Lambda special inline treatment is desirable elsewhere

Tue Sep 26 23:06:32 UTC 2023

Thanks for your replies. I'm afraid I only get more confused the more I read. I'm looking at MethodHandle, LambdaForm, the various BoundMethodHandle species, etc. LambdaForm in particular looks exactly like what I've been working on with ASM. I just want to take all the higher-order implementors of some interface (e.g. ComposedEncoder, or TernaryXBinaryCompsedFunction) and flatten them into a single, fresh method that statically invokes "leaf" methods and holds constants, and then let the JIT have its way with that. If I had known about LambdaForm and its surrounding infrastructure and it was part of the user-facing standard lib instead of internal, I would certainly be using that instead of ASM.

But what I really want is for that to just be how it works automatically, so you can just write Java methods that *look* like a bunch of different LambdaForm species and get LambdaForm performance.

Higher-order methods that consist of very little besides invoking virtual methods on their own `final` members should be automatically templated/"forked" for each unique combination of their field's classes, and should have high inline priority (at least ignore MaxInline like LambdaForm does). That would solve the problem for existing code, instead of prompting rewrites that use MethodHandle everywhere. What would the heuristic be for a method to be flagged as such? I don't know much about hotspot internals, but I'm imagining something that keys off of Method+arg0 and considers negatively how much static code there is in the method, and positively how many virtual calls there are that are monomorphic for that arg0. Something like the following would be a prime candidate for having its method code "forked" for every unique combination of field types:

    record BinaryXBinaryFunction<A, B, CA, CB, R>(
            BiFunction<A, B, CA> caf,
            BiFunction<A, B, CB> cbf,
            BiFunction<CA, CB, R> f) implements BiFunction<A, B, R> {
        R apply(A a, B b) {
            return f.apply(caf.apply(a, b), cbf.apply(a, b));
        }
    }

It wouldn't be a surprise if there were limitations I'm failing to consider here, or even if there is already something in HotSpot that does this kind of analysis and method duplication.

P.S. Please let me know if this carries out of scope for the list and I'll stop CC'ing it. Thanks.

------- Original Message -------
On Tuesday, September 26th, 2023 at 2:54 PM, Remi Forax <forax at univ-mlv.fr> wrote:

> 
> 
> Hi Randall,
> you can use the destructured visitor pattern for that, i.e. a visitor pattern where for each visit method, you can ask for several inlining caches to do the recursive calls
> 
> Let suppose you have the following hierarchy
> 
> interface Vehicle {}
> record Car(int passenger) implements Vehicle { }
> record CarrierTruck(Vehicle vehicle) implements Vehicle { }
> 
> The idea is to provide a visit method that do the work for a subtype (here CarrierTruck) but also ask to get a dispatch method to visit the rest of the tree.
> The annotation @Signature is the signature of the dispatch method. Here, it returns a String and takes a Vehicle and an int as parameters.
> 
> static String toJSON(CarrierTruck truck, int depth,
> @Signature({String.class, Vehicle.class, int.class}) MethodHandle dispatch) throws Throwable {
> return STR."""
> {
> "vehicle" : \{ (String) dispatch.invokeExact(truck.vehicle, depth + 2) }
> }
> """.indent(depth);
> }
> 
> Then you can create a destructured visitor with all the visit method and ask for a dispatch method
> 
> var lookup = MethodHandles.lookup();
> var methods = Arrays.stream(Demo.class.getDeclaredMethods())
> .filter(method -> method.getName().equals("toJSON"))
> 
> .toList();
> var destructuredVisitor = DestructuredVisitor.of(lookup, methods);
> var mh = destructuredVisitor.createVisitor(String.class, Demo.Vehicle.class, int.class);
> System.out.println("mh " + mh); // the dispatch method
> 
> 
> var truck = new Demo.CarrierTruck(new Demo.Car(4));
> var result = (String) mh.invokeExact((Demo.Vehicle) truck, 0);
> System.out.println(result);
> 
> 
> Here is a gist with a rough sketch showing how to implement it
> 
> https://gist.github.com/forax/176a61cc9707c884bfb059c31811dac9
> 
> I believe the JIT should be able to fully optimize this kind of code.
> 
> regard,
> Rémi
> 
> 
> ----- Original Message -----
> 
> > From: "Randall Oveson" rco at randalloveson.com
> > To: "Vladimir Ivanov" vladimir.x.ivanov at oracle.com
> > Cc: "hotspot-compiler-dev" hotspot-compiler-dev at openjdk.org
> > Sent: Tuesday, September 26, 2023 10:00:13 PM
> > Subject: Re: Lambda special inline treatment is desirable elsewhere
> 
> > Hi Vladimir,
> > 
> > Thanks for the reply. I've been poking at this problem for the past couple weeks
> > but mostly abandoned my attempts to solve it using the JIT alone in favor of
> > generating composed functions using ASM. However, I would still love to see a
> > future where that isn't necessary and I can just throw together a tower of
> > virtual calls and have them either inlined or staticized by the JIT reliably.
> > 
> > I've put together an example that I think better illustrates the problem I'm
> > trying to solve[0]. Instead of a JVM patch, here I get the performance increase
> > I want by copy-and-pasting my composing classes and never reusing the same one,
> > causing the JIT to consider all the call sites monomorphic and freely inline or
> > staticize them. The baseline is the "idiomatic" form, and suffers from
> > recursive inline limits as well as virtual call sites in ways that I wish it
> > didn't.
> > 
> > This problem is more severe the more polymorphism there is (bigger runtime
> > dispatch tables?) and the simpler the logical behavior of the complete
> > invocation is (e.g. an expression compiler working with `x + 1000 / y - 50 * z / 2`, or an encoder going from an Object[] of Long, Long, Int, Double, Long to
> > a compact struct in off-heap memory). Instances where throughput of the
> > hand-written function would be extremely high, so the
> > interface-composed-at-runtime function is dominated by dispatch.
> > 
> > 0. https://github.com/randalloveson/hotspot-inline-example
> > 
> > ------- Original Message -------
> > On Tuesday, September 26th, 2023 at 11:18 AM, Vladimir Ivanov
> > vladimir.x.ivanov at oracle.com wrote:
> > 
> > > Hi Randall,
> > > 
> > > I don't fully understand what kind of change you experimented with. Do
> > > you mind sharing the patch?
> > > 
> > > Compilers have special handling for lambda forms
> > > (java.lang.invoke.LambdaForm) which are the crucial piece of performant
> > > invokedynamic and java.lang.invoke implementation. Lambda forms are
> > > aggressively shared and many distinct MethodHandles share LambdaForm
> > > instances. Based on that knowledge, JVM special case them in several
> > > places. The check you refer to in InlineTree::try_to_inline() lifts
> > > recursive inlining constraints from LambdaForms to MethodHandles, but
> > > the constraint is still there.
> > > 
> > > Lambdas (Java language feature) are implemented on top of invokedynamic
> > > and JVM doesn't do anything particular to optimize specifically for them.
> > > 
> > > It would be really helpful if you share a benchmark demonstrating the
> > > use case you care about.
> > > 
> > > Best regards,
> > > Vladimir Ivanov
> > > 
> > > On 9/7/23 10:49, Randall Oveson wrote:
> > > 
> > > > I'm considering a patch to improve the performance of a common pattern
> > > > in my (and plausibly others') application. The pattern relates to
> > > > polymorphic processing of records or tuples, e.g. serializing or
> > > > deserializing an Avro or CSV record, or evaluating a runtime-constructed
> > > > expression tree.
> > > > 
> > > > You have an immutable tree (often a mere list) of objects implementing a
> > > > common interface. From a CHA perspective the interface is megamorphic,
> > > > but it's always runtime-monomorphic at most call sites (anything within
> > > > the immutable tree). The methods themselves are often cheap, sometimes
> > > > as simple as reading a single byte from a stream or doing a single
> > > > arithmetic operation, so it's imperative that they all be inlined. You
> > > > might say these methods are "dominated by their composition with other
> > > > methods".
> > > > 
> > > > In practice it is not possible to tune C2's inlining acceptably for this
> > > > pattern for a few reasons, but the major one is the recursive inlining
> > > > detection. If your tuple-processor has to deal with, say, 15-integer
> > > > type values (so 15 of the methods in the immutable call tree happen to
> > > > be the same method), you won't see any inlining happen because
> > > > InlineTree::try_to_inline considers these calls recursive and the
> > > > default MaxRecursiveInlineLevel is 1. Intuitively, these calls aren't
> > > > really "recursive" in the classic sense; the number of calls to the same
> > > > method is statically bounded, and there's nothing significant about them
> > > > being the same call anyway; they could just as well have been different
> > > > calls if the tuple types at those positions had been different.
> > > > 
> > > > It seems this problem was well-observed with lambdas, because there's an
> > > > exception carved out in try_to_inline for lambda-form methods. In those
> > > > cases, we check to see if the argument 0 ("receiver") of the method is
> > > > the same before considering it recursive.
> > > > 
> > > > One patch I tested is extending that lambda-form detection of recursive
> > > > inlining to all non-static methods. That solves my performance problem
> > > > and doesn't appear to cause any new performance problems in my project,
> > > > but I can imagine cases where it might be problematic. Still, I think
> > > > it's worth considering as a solution if it hasn't been already.
> > > > 
> > > > Another patch I've got is one that treats any non-static method that is
> > > > also @ForceInline the same as lambda-form methods in the recursive
> > > > inline check, along with a change to classFileParser.cpp to allow the
> > > > use of @ForceInline outside of privileged code (the latter change I'd
> > > > bet has been proposed before). This also solves my problem, but I doubt
> > > > it would be acceptable upstream.
> > > > 
> > > > I think my intuition about lambdas--which I'd hesitantly suggest is the
> > > > popular intuition about lambdas--being merely "syntatic sugar" for
> > > > ad-hoc abstract method implementations is at odds with the current state
> > > > of C2. The more considerate and aggressive inlining behavior is
> > > > extremely important for any immutable tree of compile-time-polymorphic,
> > > > runtime-monomorphic calls. It's unfortunate that the only way to access
> > > > that behavior is by using a different syntax, which may not be
> > > > appropriate for other reasons.
> > > > 
> > > > I'd appreciate any better ideas than the ones I've proposed here. I only
> > > > started digging into this recently and it's my first time on the openjdk
> > > > lists, so thanks in advance for your patience.
> > > > 
> > > > Randall