RFC: Untangle native libraries and the JVM: SVML, SLEEF, and libsimdsort
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Fri Dec 6 23:18:17 UTC 2024
Recently, a trend emerged to use native libraries to back intrinsics in
HotSpot JVM. SVML stubs for Vector API paved the road and it was soon
followed by SLEEF and simdsort libraries.
After examining their support, I must confess that it doesn't look
pretty. It introduces significant accidental complexity on JVM side.
HotSpot has to be taught about every entry point in each library in an
ad-hoc manner. It's inherently unsafe, error-prone to implement and hard
to maintain: JVM makes a lot of assumptions about an entry point based
solely on its symbolic name and each library has its own naming
conventions. Overall, current approach doesn't scale well.
Fortunately, new FFI API (java.lang.foreign) was finalized in 22. It
provides enough functionality to interact with native libraries from
Java in performant manner.
I did an exercise to migrate all 3 libraries away from intrinsics and
the results look promising:
simdsort: https://github.com/openjdk/jdk/pull/22621
SVML/SLEEF: https://github.com/openjdk/jdk/pull/22619
As of now, java.lang.foreign lacks vector calling convention support, so
the actual calls into SVML/SLEEF are still backed by intrinsics. But it
still enables a major cleanup on JVM side.
Also, I coded library headers and used jextract to produce initial
library API sketch in Java and it worked really well. Eventually, it can
be incorporated into JDK build process to ensure the consistency between
native and Java parts of library API.
Performance wise, it is on par with current (intrinsic-based)
implementation.
One open question relates to CPU dispatching.
Each library exposes multiple functions with different requirements
about CPU ISA extension support (e.g., no AVX vs AVX2 vs AVX512, NEON vs
SVE). Right now, it's JVM responsibility, but once it gets out of the
loop, the library itself should make the decision. I experimented with 2
approaches: (1) perform CPU dispatching with linking library from Java
code (as illustrated in aforementioned PRs); or (2) call into native
library to query it about the right entry point [1] [2] [3]. In both
cases, it depends on additional API to sense the JVM/hardware
capabilities (exposed on jdk.internal.misc.VM for now).
Let me know if you have any questions/suggestions/concerns. Thanks!
I plan to eventually start publishing PRs to upstream this work.
Best regards,
Vladimir Ivanov
[1]
https://github.com/openjdk/jdk/commit/b6e6f2e20772e86fbf9088bcef01391461c17f11
[2]
https://github.com/iwanowww/jdk/blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/share/classes/java/util/SIMDSortLibrary.java
[3]
https://github.com/iwanowww/jdk/blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/linux/native/libsimdsort/simdsort.c
More information about the hotspot-compiler-dev
mailing list