Lambda special inline treatment is desirable elsewhere

Tue Sep 26 20:33:40 UTC 2023

Thanks, Randall. I'll take a closer look.

You could also try to implement dispatch trees with MethodHandles.

Best regards,
Vladimir Ivanov

On 9/26/23 13:00, Randall Oveson wrote:
> Hi Vladimir,
> 
> Thanks for the reply. I've been poking at this problem for the past couple weeks but mostly abandoned my attempts to solve it using the JIT alone in favor of generating composed functions using ASM. However, I would still love to see a future where that isn't necessary and I can just throw together a tower of virtual calls and have them either inlined or staticized by the JIT reliably.
> 
> I've put together an example that I think better illustrates the problem I'm trying to solve[0]. Instead of a JVM patch, here I get the performance increase I want by copy-and-pasting my composing classes and never reusing the same one, causing the JIT to consider all the call sites monomorphic and freely inline or staticize them. The baseline is the "idiomatic" form, and suffers from recursive inline limits as well as virtual call sites in ways that I wish it didn't.
> 
> This problem is more severe the more polymorphism there is (bigger runtime dispatch tables?) and the simpler the logical behavior of the complete invocation is (e.g. an expression compiler working with `x + 1000 / y - 50 * z / 2`, or an encoder going from an Object[] of Long, Long, Int, Double, Long to a compact struct in off-heap memory). Instances where throughput of the hand-written function would be extremely high, so the interface-composed-at-runtime function is dominated by dispatch.
> 
> 0. https://urldefense.com/v3/__https://github.com/randalloveson/hotspot-inline-example__;!!ACWV5N9M2RV99hQ!Nr59u14m9eU7583_8MlTFMUVfviVXhDlgVjMuOsQdz9zghIt1qKrkxyNINrSu7OW2W2svLimNC2AUWZreHk8t0A$
> 
> 
> 
> 
> ------- Original Message -------
> On Tuesday, September 26th, 2023 at 11:18 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
> 
>>
>>
>> Hi Randall,
>>
>> I don't fully understand what kind of change you experimented with. Do
>> you mind sharing the patch?
>>
>> Compilers have special handling for lambda forms
>> (java.lang.invoke.LambdaForm) which are the crucial piece of performant
>> invokedynamic and java.lang.invoke implementation. Lambda forms are
>> aggressively shared and many distinct MethodHandles share LambdaForm
>> instances. Based on that knowledge, JVM special case them in several
>> places. The check you refer to in InlineTree::try_to_inline() lifts
>> recursive inlining constraints from LambdaForms to MethodHandles, but
>> the constraint is still there.
>>
>> Lambdas (Java language feature) are implemented on top of invokedynamic
>> and JVM doesn't do anything particular to optimize specifically for them.
>>
>> It would be really helpful if you share a benchmark demonstrating the
>> use case you care about.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 9/7/23 10:49, Randall Oveson wrote:
>>
>>> I'm considering a patch to improve the performance of a common pattern
>>> in my (and plausibly others') application. The pattern relates to
>>> polymorphic processing of records or tuples, e.g. serializing or
>>> deserializing an Avro or CSV record, or evaluating a runtime-constructed
>>> expression tree.
>>>
>>> You have an immutable tree (often a mere list) of objects implementing a
>>> common interface. From a CHA perspective the interface is megamorphic,
>>> but it's always runtime-monomorphic at most call sites (anything within
>>> the immutable tree). The methods themselves are often cheap, sometimes
>>> as simple as reading a single byte from a stream or doing a single
>>> arithmetic operation, so it's imperative that they all be inlined. You
>>> might say these methods are "dominated by their composition with other
>>> methods".
>>>
>>> In practice it is not possible to tune C2's inlining acceptably for this
>>> pattern for a few reasons, but the major one is the recursive inlining
>>> detection. If your tuple-processor has to deal with, say, 15-integer
>>> type values (so 15 of the methods in the immutable call tree happen to
>>> be the same method), you won't see any inlining happen because
>>> InlineTree::try_to_inline considers these calls recursive and the
>>> default MaxRecursiveInlineLevel is 1. Intuitively, these calls aren't
>>> really "recursive" in the classic sense; the number of calls to the same
>>> method is statically bounded, and there's nothing significant about them
>>> being the same call anyway; they could just as well have been different
>>> calls if the tuple types at those positions had been different.
>>>
>>> It seems this problem was well-observed with lambdas, because there's an
>>> exception carved out in try_to_inline for lambda-form methods. In those
>>> cases, we check to see if the argument 0 ("receiver") of the method is
>>> the same before considering it recursive.
>>>
>>> One patch I tested is extending that lambda-form detection of recursive
>>> inlining to all non-static methods. That solves my performance problem
>>> and doesn't appear to cause any new performance problems in my project,
>>> but I can imagine cases where it might be problematic. Still, I think
>>> it's worth considering as a solution if it hasn't been already.
>>>
>>> Another patch I've got is one that treats any non-static method that is
>>> also @ForceInline the same as lambda-form methods in the recursive
>>> inline check, along with a change to classFileParser.cpp to allow the
>>> use of @ForceInline outside of privileged code (the latter change I'd
>>> bet has been proposed before). This also solves my problem, but I doubt
>>> it would be acceptable upstream.
>>>
>>> I think my intuition about lambdas--which I'd hesitantly suggest is the
>>> popular intuition about lambdas--being merely "syntatic sugar" for
>>> ad-hoc abstract method implementations is at odds with the current state
>>> of C2. The more considerate and aggressive inlining behavior is
>>> extremely important for any immutable tree of compile-time-polymorphic,
>>> runtime-monomorphic calls. It's unfortunate that the only way to access
>>> that behavior is by using a different syntax, which may not be
>>> appropriate for other reasons.
>>>
>>> I'd appreciate any better ideas than the ones I've proposed here. I only
>>> started digging into this recently and it's my first time on the openjdk
>>> lists, so thanks in advance for your patience.
>>>
>>> Randall