[VectorAPI] Enhancement of floating-point math vector API implementation

Fri Dec 16 02:50:02 UTC 2022

Hi,

This is a proposal to do following enhancement for Vector API
floating-point vector math operations:

1. Intrinsify the vector math APIs for AArch64 platform by calling a
  third party library SLEEF [1][2].
2. Fix the interpreter and c2 compiler inconsistent result issue [3].

We created a prototype [4] to do the above enhancement as well as
some cleanup to make the shared code platform independent.

- Background -

1. Optimize the math APIs for AArch64 platform

Currently the floating-point math APIs like "SIN/COS/TAN..." are not
intrinsified on AArch64 platform. This makes these APIs have large
performance gap on AArch64. Note that those APIs are intrinsified
by C2 compiler on X86 platforms. They are implemented by calling
the Intel's SVML library [5]. We'd like to optimize these APIs for
AArch64 by intrinsifying them with a vector library, such as SLEEF.

2. Fix the inconsistent result issue from different code paths

As discussed before [3], it is a potential issue of Vector API's FP math
operations: the results may be different when the same API is executed
with C2 and interpreter on x86 systems. The main reason is the API's
default implementation (used by interpreter) calls different library with
the vector intrinsics (used by C2). The default implementation calls the
scalar java.lang.Math APIs, while the intrinsics calls the vector library
(i.e. SVML) on X86. The same to AArch64 if these APIs are intrinsified
by calling a different vector library like SLEEF.

The implementations are all guaranteed to be in 1.0 ulp error bound
which follows the API's specification. And it sounds reasonable that
the different implementations generate different results for the same
input. Even so, we argue that it's still confusing and problematic to see
the different results on different runs (could be interpreter or c2), and
it would be better if we can fix this.

One straightforward fix is to align the different implementations between
interpreter and C2. Here we propose to update the APIs' default
implementation, by using the native vector library via JNI if it is supported
and falling back to the scalar implementation if not. Here shows the new
code of the default implementation:

```
@ForceInline
final
FloatVector mathUOpTemplate(int opd, VectorMask<Float> m, FUnOp f) {
    if (VectorSupport.hasNativeImpl(opd, float.class, length())) {
        float[] vec = vec();
        Object obj = VectorSupport.nativeImpl(opd, float.class, length(), vec, null);
        if (obj != null) {
            float[] res = (float[]) obj;
            if (m != null) {
                boolean[] mbits = ((AbstractMask<Float>)m).getBits();
                for (int i = 0; i < res.length; i++) {
                    res[i] = mbits[i] ? res[i] : vec[i];
                }
            }
            return vectorFactory(res);
        }
    }
    return uOpTemplate(m, f);
}
```

The first if-stmt checks whether the given operation can be vectorized with
a vector math library on current hardware. If yes, calls the library via JNI.
Otherwise, falls back to the original scalar implementation, i.e. uOpTemplate(m, f).
It's worth noting that the check and the vector library we use for the default
implementation are identical to those for C2 vector intrinsics. In this way, we
can guarantee both interpreter and C2 use the same implementation and then
produce the same result.

- Prototype -

We created one prototype [4] to do the above enhancement. Here are the
main changes:

1. Optimize these math APIs by calling SLEEF for AArch64 NEON and SVE:
  - We choose 1.0 ULP accuracy with FMA instructions used versions for
   most of the operations by default. For those APIs that SLEEF does not
   support 1.0 ULP, we choose 0.5 ULP instead.
  - We didn't use the version that can give consistent results across all
   platforms in the prototype.
  - We add the vector calling convention for AArch64 NEON and SVE.
  - The system library (i.e. libsleef-dev [2]) should be installed before using
   SLEEF. If not installed, the math vector call will still use current default
   scalar version without error.

2. Fix the result difference issue from the following sides:
1) Java API
  - Add two native methods "hasNativeImpl()" and "nativeImpl()". The
   former determines whether the underlying hardware supports the
   vectorized native implementation for the given op and vector type.
   And given the op, vector type and vector inputs, the latter returns the
   result by calling the native library.
  - Change the floating point math APIs' default implementation.

2) Hotspot
  - Add the implementation to the two native methods via JNI.
  - Add a stub routine (i.e. _vector_math_wrapper). It is a wrapper to call
   the native vector implementation to handle ABI difference. Because the
   inputs and output of the native method are array address, while they are
   vector registers in the vector implementation. We need to do the vector
   load before calling the vector library, and do a vector store to save the result.
  - Add the initial code generation to "_vector_math_wrapper" for X86 and
   AArch64 platforms.

3) Test
  - Change the existing Jtreg test cases. Each API with the same input would
   be run two times, with and without loop warmup phase. And the
   computation results between these two runs are compared. Without the
   new changes to the APIs' default implementation, these tests will fail on
   X86 and AArch64.

- Performance -

After enabling SLEEF for AArch64, the performance of these APIs' JMH benchmarks
[6] improves about 1.5x ~ 12x on NEON, and 3x ~ 62x on SVE with 512-bit vector
size.

[1] https://sleef.org/
[2] https://packages.debian.org/bookworm/libsleef-dev
[3] https://mail.openjdk.org/pipermail/panama-dev/2022-August/017372.html
[4] https://github.com/XiaohongGong/jdk/tree/vectorapi-fp-math
[5] https://github.com/openjdk/jdk/pull/3638
[6] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java

Any feedback is appreciated! Thanks in advance!

Best Regards,
Xiaohong Gong