RFC: Untangle native libraries and the JVM: SVML, SLEEF, and libsimdsort

Wed Dec 11 01:04:36 UTC 2024

Thanks, Maurizio.

On 12/9/24 03:42, Maurizio Cimadamore wrote:
> Great work Vlad!
> 
> The simdsort part seems a more "classic" FFM binding - where you have a 
> method handle per entry point. That seems to fit the design of FFM 
> rather well. In the second case (SVML/SLEEF) usage of FFM is limited to 
> build a "table of entry points" (e.g. we're just using SymbolLookup + 
> MemorySegment here -- the invocation part is intrinsified as part of the 
> new VectorSupport methods).

I'd say that both simdsort and SVML/SLEEF cases are slightly off from 
the sweet spot FFM API is designed for since all 3 libraries heavily 
rely on CPU dispatching.

> If it helps, it might be possible to define a custom (JDK internal) 
> family of value layouts for vector types. Then we could enhance the 
> Linker classification to support such layouts. This means you could call 
> into native functions with vector parameters and return types using the 
> Linker API more directly. Not sure if it will give you the same 
> performance, but it's also an approach worth exploring.

FTR I experimented a bit with vector calling conventions support, but as 
Vector API is implemented now, it introduced significant amount of 
complexity on both sides, so I decided to keep vector intrinsics for 
now. It already enables significant simplifications in Vector API.

Still, it would be convenient to eventually get vector support in FFM.

> Re. support for custom calling conventions to call into hotspot stubs 
> from Java, this might be possible - our story for supporting calling 
> conventions other than the system calling convention is that there 
> should be a dedicated linker instance per calling convention. So, if the 
> JVM defines its own calling convention for its stubs there should 
> probably be a custom Linker implementation that is used to call into 
> such stubs - which uses the machinery in the Linker implementation (e.g. 
> Bindings) to classify the incoming function descriptors and determine 
> the shuffle sequence for a given particular call. This should all be 
> doable (at least inside the JDK) - it's just matter of "writing more code".

Interesting. Thanks for the details.

> I agree with Paul that, as we move more stuff to use Panama, we will 
> need to look more at the avenues available to us to claim back some of 
> the additional warm up cost introduced by the use of var/method handles. 
> This is probably part of a bigger exploration on warmup and FFM.

In case of C2 intrinsics it may be less of an issue. Additional startup 
costs may be quickly recuperated during warmup because optimized 
implementation is available earlier.

Best regards,
Vladimir Ivanov

> On 06/12/2024 23:18, Vladimir Ivanov wrote:
>> Recently, a trend emerged to use native libraries to back intrinsics 
>> in HotSpot JVM. SVML stubs for Vector API paved the road and it was 
>> soon followed by SLEEF and simdsort libraries.
>>
>> After examining their support, I must confess that it doesn't look 
>> pretty. It introduces significant accidental complexity on JVM side. 
>> HotSpot has to be taught about every entry point in each library in an 
>> ad-hoc manner. It's inherently unsafe, error-prone to implement and 
>> hard to maintain: JVM makes a lot of assumptions about an entry point 
>> based solely on its symbolic name and each library has its own naming 
>> conventions. Overall, current approach doesn't scale well.
>>
>> Fortunately, new FFI API (java.lang.foreign) was finalized in 22. It 
>> provides enough functionality to interact with native libraries from 
>> Java in performant manner.
>>
>> I did an exercise to migrate all 3 libraries away from intrinsics and 
>> the results look promising:
>>
>>   simdsort: https://github.com/openjdk/jdk/pull/22621
>>
>>   SVML/SLEEF: https://github.com/openjdk/jdk/pull/22619
>>
>> As of now, java.lang.foreign lacks vector calling convention support, 
>> so the actual calls into SVML/SLEEF are still backed by intrinsics. 
>> But it still enables a major cleanup on JVM side.
>>
>> Also, I coded library headers and used jextract to produce initial 
>> library API sketch in Java and it worked really well. Eventually, it 
>> can be incorporated into JDK build process to ensure the consistency 
>> between native and Java parts of library API.
>>
>> Performance wise, it is on par with current (intrinsic-based) 
>> implementation.
>>
>> One open question relates to CPU dispatching.
>>
>> Each library exposes multiple functions with different requirements 
>> about CPU ISA extension support (e.g., no AVX vs AVX2 vs AVX512, NEON 
>> vs SVE). Right now, it's JVM responsibility, but once it gets out of 
>> the loop, the library itself should make the decision. I experimented 
>> with 2 approaches: (1) perform CPU dispatching with linking library 
>> from Java code (as illustrated in aforementioned PRs); or (2) call 
>> into native library to query it about the right entry point [1] [2] 
>> [3]. In both cases, it depends on additional API to sense the JVM/ 
>> hardware capabilities (exposed on jdk.internal.misc.VM for now).
>>
>> Let me know if you have any questions/suggestions/concerns. Thanks!
>>
>> I plan to eventually start publishing PRs to upstream this work.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1] https://github.com/openjdk/jdk/commit/ 
>> b6e6f2e20772e86fbf9088bcef01391461c17f11
>>
>> [2] https://github.com/iwanowww/jdk/ 
>> blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/share/ 
>> classes/java/util/SIMDSortLibrary.java
>>
>> [3] https://github.com/iwanowww/jdk/ 
>> blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/linux/ 
>> native/libsimdsort/simdsort.c
>>