[VectorAPI] Enhancement of floating-point math vector API implementation

Thu Jan 5 23:10:46 UTC 2023

Thanks for looking into that, Xiaohong!

> 1. Intrinsify the vector math APIs for AArch64 platform by calling a
>    third party library SLEEF [1][2].
> 2. Fix the interpreter and c2 compiler inconsistent result issue [3].

I'd suggest to address these problems separately.

Speaking of #2, sharing the implementation across all execution modes 
requires 2 native entry points per vector math method: interpreter/C1 
work with boxed representation while C2 is able to perform vector calls 
when intrinsification succeeds.

The simplest way to proceed would be to code boxed variants as JNI 
wrappers and put them into a native library bundled with the JDK.

I question the decision to generate such wrappers by the JVM. They 
aren't performance critical and it adds significant amount of complexity 
into JVM.

For background, the current state of SVML support is transient. What was 
there in the original prototype (the vector math stubs were part of the 
JVM) was not acceptable for a release. So, as a stop-the-gap solution, 
SVML stubs were put into a separate library which was still tightly 
coupled with the JVM. The plan is to eventually migrate to Foreign 
Linker API once it supports vector calling conventions.

Once Foreign Linker API supports vector calling conventions and agrees 
with Vector API on the way vectors are represented, there'll be no need 
to intrinsify vector math routines at all. Foreign linker support in C2 
should do all the job by itself. But we are not there yet.

I suggest you to look into similar direction: represent vector math 
routines as `j.l.i.MethodHandle`s and perform dispatching between Java 
and native implementations during class initialization. It won't 
immediately address intrinsification aspect: all vector math routines 
called from generated code still have to be well-known to the JVM.

Speaking of 3rd party library dependency, I don't have enough knowledge 
right now to comment what would be acceptable for JDK. For a prototype, 
it should be fine to extend JDK build system with additional parameter 
which points at SLEEF library location. During JDK build it would allow 
to build the wrapper and optionally bundle a copy of SLEEF library in 
the JDK image which is then used at runtime by the JDK. That would make 
it mostly on par with existing SVML support. Then both implementations 
(SVML on x86 and SLEEF on AArch64) can be improved jointly.

Best regards,
Vladimir Ivanov

> 
> We created a prototype [4] to do the above enhancement as well as
> some cleanup to make the shared code platform independent.
> 
> - Background -
> 
> 1. Optimize the math APIs for AArch64 platform
> 
> Currently the floating-point math APIs like "SIN/COS/TAN..." are not
> intrinsified on AArch64 platform. This makes these APIs have large
> performance gap on AArch64. Note that those APIs are intrinsified
> by C2 compiler on X86 platforms. They are implemented by calling
> the Intel's SVML library [5]. We'd like to optimize these APIs for
> AArch64 by intrinsifying them with a vector library, such as SLEEF.
> 
> 2. Fix the inconsistent result issue from different code paths
> 
> As discussed before [3], it is a potential issue of Vector API's FP math
> operations: the results may be different when the same API is executed
> with C2 and interpreter on x86 systems. The main reason is the API's
> default implementation (used by interpreter) calls different library with
> the vector intrinsics (used by C2). The default implementation calls the
> scalar java.lang.Math APIs, while the intrinsics calls the vector library
> (i.e. SVML) on X86. The same to AArch64 if these APIs are intrinsified
> by calling a different vector library like SLEEF.
> 
> The implementations are all guaranteed to be in 1.0 ulp error bound
> which follows the API's specification. And it sounds reasonable that
> the different implementations generate different results for the same
> input. Even so, we argue that it's still confusing and problematic to see
> the different results on different runs (could be interpreter or c2), and
> it would be better if we can fix this.
> 
> One straightforward fix is to align the different implementations between
> interpreter and C2. Here we propose to update the APIs' default
> implementation, by using the native vector library via JNI if it is supported
> and falling back to the scalar implementation if not. Here shows the new
> code of the default implementation:
> 
> ```
> @ForceInline
> final
> FloatVector mathUOpTemplate(int opd, VectorMask<Float> m, FUnOp f) {
>      if (VectorSupport.hasNativeImpl(opd, float.class, length())) {
>          float[] vec = vec();
>          Object obj = VectorSupport.nativeImpl(opd, float.class, length(), vec, null);
>          if (obj != null) {
>              float[] res = (float[]) obj;
>              if (m != null) {
>                  boolean[] mbits = ((AbstractMask<Float>)m).getBits();
>                  for (int i = 0; i < res.length; i++) {
>                      res[i] = mbits[i] ? res[i] : vec[i];
>                  }
>              }
>              return vectorFactory(res);
>          }
>      }
>      return uOpTemplate(m, f);
> }
> ```
> 
> The first if-stmt checks whether the given operation can be vectorized with
> a vector math library on current hardware. If yes, calls the library via JNI.
> Otherwise, falls back to the original scalar implementation, i.e. uOpTemplate(m, f).
> It's worth noting that the check and the vector library we use for the default
> implementation are identical to those for C2 vector intrinsics. In this way, we
> can guarantee both interpreter and C2 use the same implementation and then
> produce the same result.
> 
> - Prototype -
> 
> We created one prototype [4] to do the above enhancement. Here are the
> main changes:
> 
> 1. Optimize these math APIs by calling SLEEF for AArch64 NEON and SVE:
>    - We choose 1.0 ULP accuracy with FMA instructions used versions for
>     most of the operations by default. For those APIs that SLEEF does not
>     support 1.0 ULP, we choose 0.5 ULP instead.
>    - We didn't use the version that can give consistent results across all
>     platforms in the prototype.
>    - We add the vector calling convention for AArch64 NEON and SVE.
>    - The system library (i.e. libsleef-dev [2]) should be installed before using
>     SLEEF. If not installed, the math vector call will still use current default
>     scalar version without error.
> 
> 2. Fix the result difference issue from the following sides:
> 1) Java API
>    - Add two native methods "hasNativeImpl()" and "nativeImpl()". The
>     former determines whether the underlying hardware supports the
>     vectorized native implementation for the given op and vector type.
>     And given the op, vector type and vector inputs, the latter returns the
>     result by calling the native library.
>    - Change the floating point math APIs' default implementation.
> 
> 2) Hotspot
>    - Add the implementation to the two native methods via JNI.
>    - Add a stub routine (i.e. _vector_math_wrapper). It is a wrapper to call
>     the native vector implementation to handle ABI difference. Because the
>     inputs and output of the native method are array address, while they are
>     vector registers in the vector implementation. We need to do the vector
>     load before calling the vector library, and do a vector store to save the result.
>    - Add the initial code generation to "_vector_math_wrapper" for X86 and
>     AArch64 platforms.
> 
> 3) Test
>    - Change the existing Jtreg test cases. Each API with the same input would
>     be run two times, with and without loop warmup phase. And the
>     computation results between these two runs are compared. Without the
>     new changes to the APIs' default implementation, these tests will fail on
>     X86 and AArch64.
> 
> - Performance -
> 
> After enabling SLEEF for AArch64, the performance of these APIs' JMH benchmarks
> [6] improves about 1.5x ~ 12x on NEON, and 3x ~ 62x on SVE with 512-bit vector
> size.
> 
> [1] https://sleef.org/
> [2] https://packages.debian.org/bookworm/libsleef-dev
> [3] https://mail.openjdk.org/pipermail/panama-dev/2022-August/017372.html
> [4] https://github.com/XiaohongGong/jdk/tree/vectorapi-fp-math
> [5] https://github.com/openjdk/jdk/pull/3638
> [6] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java
> 
> Any feedback is appreciated! Thanks in advance!
> 
> Best Regards,
> Xiaohong Gong