[VectorAPI] Enhancement of floating-point math vector API implementation

Fri Jan 6 19:02:37 UTC 2023

As an idea for an incremental improvement, consider lifting native 
library linkage code from JVM into JDK code and refactor corresponding 
intrinsics (require new ones I believe) to accept the specialized entry 
point address of the correspoding vectorized math routine (instead of 
the operation code). Then C2 could use the address to generate direct 
leaf call into the stub.

That would allow to make JVM code library-agnostic and abstract away all 
the differences between SVML and SLEEF libraries (and, in the future, 
any other library added as a backing implementation).

On JDK level multiple plug-in implementations could be supported and the 
final decision what implementation to use is performed at runtime 
depending on the presence of required libraries or user choice.

Best regards,
Vladimir Ivanov

On 1/5/23 15:10, Vladimir Ivanov wrote:
> Thanks for looking into that, Xiaohong!
> 
>> 1. Intrinsify the vector math APIs for AArch64 platform by calling a
>>    third party library SLEEF [1][2].
>> 2. Fix the interpreter and c2 compiler inconsistent result issue [3].
> 
> I'd suggest to address these problems separately.
> 
> Speaking of #2, sharing the implementation across all execution modes 
> requires 2 native entry points per vector math method: interpreter/C1 
> work with boxed representation while C2 is able to perform vector calls 
> when intrinsification succeeds.
> 
> The simplest way to proceed would be to code boxed variants as JNI 
> wrappers and put them into a native library bundled with the JDK.
> 
> I question the decision to generate such wrappers by the JVM. They 
> aren't performance critical and it adds significant amount of complexity 
> into JVM.
> 
> For background, the current state of SVML support is transient. What was 
> there in the original prototype (the vector math stubs were part of the 
> JVM) was not acceptable for a release. So, as a stop-the-gap solution, 
> SVML stubs were put into a separate library which was still tightly 
> coupled with the JVM. The plan is to eventually migrate to Foreign 
> Linker API once it supports vector calling conventions.
> 
> Once Foreign Linker API supports vector calling conventions and agrees 
> with Vector API on the way vectors are represented, there'll be no need 
> to intrinsify vector math routines at all. Foreign linker support in C2 
> should do all the job by itself. But we are not there yet.
> 
> I suggest you to look into similar direction: represent vector math 
> routines as `j.l.i.MethodHandle`s and perform dispatching between Java 
> and native implementations during class initialization. It won't 
> immediately address intrinsification aspect: all vector math routines 
> called from generated code still have to be well-known to the JVM.
> 
> Speaking of 3rd party library dependency, I don't have enough knowledge 
> right now to comment what would be acceptable for JDK. For a prototype, 
> it should be fine to extend JDK build system with additional parameter 
> which points at SLEEF library location. During JDK build it would allow 
> to build the wrapper and optionally bundle a copy of SLEEF library in 
> the JDK image which is then used at runtime by the JDK. That would make 
> it mostly on par with existing SVML support. Then both implementations 
> (SVML on x86 and SLEEF on AArch64) can be improved jointly.
> 
> Best regards,
> Vladimir Ivanov
> 
>>
>> We created a prototype [4] to do the above enhancement as well as
>> some cleanup to make the shared code platform independent.
>>
>> - Background -
>>
>> 1. Optimize the math APIs for AArch64 platform
>>
>> Currently the floating-point math APIs like "SIN/COS/TAN..." are not
>> intrinsified on AArch64 platform. This makes these APIs have large
>> performance gap on AArch64. Note that those APIs are intrinsified
>> by C2 compiler on X86 platforms. They are implemented by calling
>> the Intel's SVML library [5]. We'd like to optimize these APIs for
>> AArch64 by intrinsifying them with a vector library, such as SLEEF.
>>
>> 2. Fix the inconsistent result issue from different code paths
>>
>> As discussed before [3], it is a potential issue of Vector API's FP math
>> operations: the results may be different when the same API is executed
>> with C2 and interpreter on x86 systems. The main reason is the API's
>> default implementation (used by interpreter) calls different library with
>> the vector intrinsics (used by C2). The default implementation calls the
>> scalar java.lang.Math APIs, while the intrinsics calls the vector library
>> (i.e. SVML) on X86. The same to AArch64 if these APIs are intrinsified
>> by calling a different vector library like SLEEF.
>>
>> The implementations are all guaranteed to be in 1.0 ulp error bound
>> which follows the API's specification. And it sounds reasonable that
>> the different implementations generate different results for the same
>> input. Even so, we argue that it's still confusing and problematic to see
>> the different results on different runs (could be interpreter or c2), and
>> it would be better if we can fix this.
>>
>> One straightforward fix is to align the different implementations between
>> interpreter and C2. Here we propose to update the APIs' default
>> implementation, by using the native vector library via JNI if it is 
>> supported
>> and falling back to the scalar implementation if not. Here shows the new
>> code of the default implementation:
>>
>> ```
>> @ForceInline
>> final
>> FloatVector mathUOpTemplate(int opd, VectorMask<Float> m, FUnOp f) {
>>      if (VectorSupport.hasNativeImpl(opd, float.class, length())) {
>>          float[] vec = vec();
>>          Object obj = VectorSupport.nativeImpl(opd, float.class, 
>> length(), vec, null);
>>          if (obj != null) {
>>              float[] res = (float[]) obj;
>>              if (m != null) {
>>                  boolean[] mbits = ((AbstractMask<Float>)m).getBits();
>>                  for (int i = 0; i < res.length; i++) {
>>                      res[i] = mbits[i] ? res[i] : vec[i];
>>                  }
>>              }
>>              return vectorFactory(res);
>>          }
>>      }
>>      return uOpTemplate(m, f);
>> }
>> ```
>>
>> The first if-stmt checks whether the given operation can be vectorized 
>> with
>> a vector math library on current hardware. If yes, calls the library 
>> via JNI.
>> Otherwise, falls back to the original scalar implementation, i.e. 
>> uOpTemplate(m, f).
>> It's worth noting that the check and the vector library we use for the 
>> default
>> implementation are identical to those for C2 vector intrinsics. In 
>> this way, we
>> can guarantee both interpreter and C2 use the same implementation and 
>> then
>> produce the same result.
>>
>> - Prototype -
>>
>> We created one prototype [4] to do the above enhancement. Here are the
>> main changes:
>>
>> 1. Optimize these math APIs by calling SLEEF for AArch64 NEON and SVE:
>>    - We choose 1.0 ULP accuracy with FMA instructions used versions for
>>     most of the operations by default. For those APIs that SLEEF does not
>>     support 1.0 ULP, we choose 0.5 ULP instead.
>>    - We didn't use the version that can give consistent results across 
>> all
>>     platforms in the prototype.
>>    - We add the vector calling convention for AArch64 NEON and SVE.
>>    - The system library (i.e. libsleef-dev [2]) should be installed 
>> before using
>>     SLEEF. If not installed, the math vector call will still use 
>> current default
>>     scalar version without error.
>>
>> 2. Fix the result difference issue from the following sides:
>> 1) Java API
>>    - Add two native methods "hasNativeImpl()" and "nativeImpl()". The
>>     former determines whether the underlying hardware supports the
>>     vectorized native implementation for the given op and vector type.
>>     And given the op, vector type and vector inputs, the latter 
>> returns the
>>     result by calling the native library.
>>    - Change the floating point math APIs' default implementation.
>>
>> 2) Hotspot
>>    - Add the implementation to the two native methods via JNI.
>>    - Add a stub routine (i.e. _vector_math_wrapper). It is a wrapper 
>> to call
>>     the native vector implementation to handle ABI difference. Because 
>> the
>>     inputs and output of the native method are array address, while 
>> they are
>>     vector registers in the vector implementation. We need to do the 
>> vector
>>     load before calling the vector library, and do a vector store to 
>> save the result.
>>    - Add the initial code generation to "_vector_math_wrapper" for X86 
>> and
>>     AArch64 platforms.
>>
>> 3) Test
>>    - Change the existing Jtreg test cases. Each API with the same 
>> input would
>>     be run two times, with and without loop warmup phase. And the
>>     computation results between these two runs are compared. Without the
>>     new changes to the APIs' default implementation, these tests will 
>> fail on
>>     X86 and AArch64.
>>
>> - Performance -
>>
>> After enabling SLEEF for AArch64, the performance of these APIs' JMH 
>> benchmarks
>> [6] improves about 1.5x ~ 12x on NEON, and 3x ~ 62x on SVE with 
>> 512-bit vector
>> size.
>>
>> [1] https://sleef.org/
>> [2] https://packages.debian.org/bookworm/libsleef-dev
>> [3] https://mail.openjdk.org/pipermail/panama-dev/2022-August/017372.html
>> [4] https://github.com/XiaohongGong/jdk/tree/vectorapi-fp-math
>> [5] https://github.com/openjdk/jdk/pull/3638
>> [6] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java
>>
>> Any feedback is appreciated! Thanks in advance!
>>
>> Best Regards,
>> Xiaohong Gong