[VectorAPI] Enhancement of floating-point math vector API implementation

Wed Jan 4 08:47:40 UTC 2023

Hi Paul,

I'm sorry that I mis-understood your comment about following 0.5 ulp issue.

>> In your code here you set 0.5 ULP for VECTOR_OP_HYPOT, but its not reset for other the ops:

>> https://github.com/XiaohongGong/jdk/blob/e728a8f420a3927c3b9ea9dc857a7
>> 17057dc2f6e/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L8225

>> It just so happens VECTOR_OP_HYPOT is the last math library op! But it's probably best not to assume that.

> The reason that I use 0.5 ULP for VECTOR_OP_HYPOT is SLEEF doesn't have the 1.0 ULP implementation for it. For other ops, we use 1.0 ULP by default.

That's an issue exactly. Thanks for pointing this out! It's better to add an "else" block to handle other ops, like:

```
if (vop == VECTOR_OP_HYPOT ) {
  ulf = "u05";
} else { 
  ulf = "u10";
}
```
I will fix it. Thanks again!

Best Regards,
Xiaohong

-----Original Message-----
From: Xiaohong Gong 
Sent: Wednesday, January 4, 2023 4:10 PM
To: Paul Sandoz <paul.sandoz at oracle.com>
Cc: panama-dev at openjdk.java.net; nd <nd at arm.com>
Subject: RE: [VectorAPI] Enhancement of floating-point math vector API implementation

Hi Paul,

Thanks so much for looking at the prototype!

> We will likely need a hotspot -XX option declaring the sleef library path/name, rather than hardcoding it, that when declared enables the functionality. This is about a native external dependency that that might need to be supported for many many years. It's one reason why the SVML stubs were brought into HotSpot, but that is likely impractical for the sleef library. This will require some discussion with the HotSpot reviewers.

The default sleef library I used in the prototype is the installed system library by distro. It could be searchable by standard shared library searching mechanism. And, yes, it's better to use a hotspot option to declare the library path/name, whose default value is the system library path.

And I totally agree that it needs some discussion regarding to the external dependency. There may be some legal issues here. We currently just choose SLEEF as an alternative for reference implementation, and we haven't spent much time to investigate its long term maintainability and stability. And we also need to know whether a such kind of third-party library is acceptable by OpenJDK. After those investigation done, I will create a PR in the jdk mainline for the further discussion.

> In your code here you set 0.5 ULP for VECTOR_OP_HYPOT, but its not reset for other the ops:

> https://github.com/XiaohongGong/jdk/blob/e728a8f420a3927c3b9ea9dc857a7
> 17057dc2f6e/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L8225

> It just so happens VECTOR_OP_HYPOT is the last math library op! But it's probably best not to assume that.

The reason that I use 0.5 ULP for VECTOR_OP_HYPOT is SLEEF doesn't have the 1.0 ULP implementation for it. For other ops, we use 1.0 ULP by default.

Thanks,
Xiaohong

-----Original Message-----
From: Paul Sandoz <paul.sandoz at oracle.com>
Sent: Wednesday, January 4, 2023 9:15 AM
To: Xiaohong Gong <Xiaohong.Gong at arm.com>
Cc: panama-dev at openjdk.java.net; nd <nd at arm.com>
Subject: Re: [VectorAPI] Enhancement of floating-point math vector API implementation

Hi Xiaohong,

That looks like some nice work. This solution should also work for tiered compilation.

We will likely need a hotspot -XX option declaring the sleef library path/name, rather than hardcoding it, that when declared enables the functionality. This is about a native external dependency that that might need to be supported for many many years. It's one reason why the SVML stubs were brought into HotSpot, but that is likely impractical for the sleef library. This will require some discussion with the HotSpot reviewers.

I think the approach you have taken in the fallback method is a good one. We could generalize the query to any kind of vector operation, and then differentiate between operations that have and don’t have stubs. This would allow us to surface up the more general query as a Java API point.

—

In your code here you set 0.5 ULP for VECTOR_OP_HYPOT, but its not reset for other the ops:

https://github.com/XiaohongGong/jdk/blob/e728a8f420a3927c3b9ea9dc857a717057dc2f6e/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L8225

It just so happens VECTOR_OP_HYPOT is the last math library op! But it's probably best not to assume that.

Paul.

> On Dec 15, 2022, at 6:50 PM, Xiaohong Gong <Xiaohong.Gong at arm.com> wrote:
> 
> Hi,
> 
> This is a proposal to do following enhancement for Vector API 
> floating-point vector math operations:
> 
> 1. Intrinsify the vector math APIs for AArch64 platform by calling a 
> third party library SLEEF [1][2].
> 2. Fix the interpreter and c2 compiler inconsistent result issue [3].
> 
> We created a prototype [4] to do the above enhancement as well as some 
> cleanup to make the shared code platform independent.
> 
> - Background -
> 
> 1. Optimize the math APIs for AArch64 platform
> 
> Currently the floating-point math APIs like "SIN/COS/TAN..." are not 
> intrinsified on AArch64 platform. This makes these APIs have large 
> performance gap on AArch64. Note that those APIs are intrinsified by
> C2 compiler on X86 platforms. They are implemented by calling the 
> Intel's SVML library [5]. We'd like to optimize these APIs for
> AArch64 by intrinsifying them with a vector library, such as SLEEF.
> 
> 2. Fix the inconsistent result issue from different code paths
> 
> As discussed before [3], it is a potential issue of Vector API's FP 
> math
> operations: the results may be different when the same API is executed 
> with C2 and interpreter on x86 systems. The main reason is the API's 
> default implementation (used by interpreter) calls different library 
> with the vector intrinsics (used by C2). The default implementation 
> calls the scalar java.lang.Math APIs, while the intrinsics calls the 
> vector library (i.e. SVML) on X86. The same to AArch64 if these APIs 
> are intrinsified by calling a different vector library like SLEEF.
> 
> The implementations are all guaranteed to be in 1.0 ulp error bound 
> which follows the API's specification. And it sounds reasonable that 
> the different implementations generate different results for the same 
> input. Even so, we argue that it's still confusing and problematic to 
> see the different results on different runs (could be interpreter or 
> c2), and it would be better if we can fix this.
> 
> One straightforward fix is to align the different implementations 
> between interpreter and C2. Here we propose to update the APIs'
> default implementation, by using the native vector library via JNI if 
> it is supported and falling back to the scalar implementation if not.
> Here shows the new code of the default implementation:
> 
> ```
> @ForceInline
> final
> FloatVector mathUOpTemplate(int opd, VectorMask<Float> m, FUnOp f) {
>    if (VectorSupport.hasNativeImpl(opd, float.class, length())) {
>        float[] vec = vec();
>        Object obj = VectorSupport.nativeImpl(opd, float.class, length(), vec, null);
>        if (obj != null) {
>            float[] res = (float[]) obj;
>            if (m != null) {
>                boolean[] mbits = ((AbstractMask<Float>)m).getBits();
>                for (int i = 0; i < res.length; i++) {
>                    res[i] = mbits[i] ? res[i] : vec[i];
>                }
>            }
>            return vectorFactory(res);
>        }
>    }
>    return uOpTemplate(m, f);
> }
> ```
> 
> The first if-stmt checks whether the given operation can be vectorized 
> with a vector math library on current hardware. If yes, calls the library via JNI.
> Otherwise, falls back to the original scalar implementation, i.e. uOpTemplate(m, f).
> It's worth noting that the check and the vector library we use for the 
> default implementation are identical to those for C2 vector 
> intrinsics. In this way, we can guarantee both interpreter and C2 use 
> the same implementation and then produce the same result.
> 
> - Prototype -
> 
> We created one prototype [4] to do the above enhancement. Here are the 
> main changes:
> 
> 1. Optimize these math APIs by calling SLEEF for AArch64 NEON and SVE:
>  - We choose 1.0 ULP accuracy with FMA instructions used versions for
>   most of the operations by default. For those APIs that SLEEF does not
>   support 1.0 ULP, we choose 0.5 ULP instead.
>  - We didn't use the version that can give consistent results across all
>   platforms in the prototype.
>  - We add the vector calling convention for AArch64 NEON and SVE.
>  - The system library (i.e. libsleef-dev [2]) should be installed before using
>   SLEEF. If not installed, the math vector call will still use current default
>   scalar version without error.
> 
> 2. Fix the result difference issue from the following sides:
> 1) Java API
>  - Add two native methods "hasNativeImpl()" and "nativeImpl()". The
>   former determines whether the underlying hardware supports the
>   vectorized native implementation for the given op and vector type.
>   And given the op, vector type and vector inputs, the latter returns the
>   result by calling the native library.
>  - Change the floating point math APIs' default implementation.
> 
> 2) Hotspot
>  - Add the implementation to the two native methods via JNI.
>  - Add a stub routine (i.e. _vector_math_wrapper). It is a wrapper to call
>   the native vector implementation to handle ABI difference. Because the
>   inputs and output of the native method are array address, while they are
>   vector registers in the vector implementation. We need to do the vector
>   load before calling the vector library, and do a vector store to save the result.
>  - Add the initial code generation to "_vector_math_wrapper" for X86 and
>   AArch64 platforms.
> 
> 3) Test
>  - Change the existing Jtreg test cases. Each API with the same input would
>   be run two times, with and without loop warmup phase. And the
>   computation results between these two runs are compared. Without the
>   new changes to the APIs' default implementation, these tests will fail on
>   X86 and AArch64.
> 
> - Performance -
> 
> After enabling SLEEF for AArch64, the performance of these APIs' JMH 
> benchmarks [6] improves about 1.5x ~ 12x on NEON, and 3x ~ 62x on SVE 
> with 512-bit vector size.
> 
> [1] https://sleef.org/
> [2] https://packages.debian.org/bookworm/libsleef-dev
> [3]
> https://mail.openjdk.org/pipermail/panama-dev/2022-August/017372.html
> [4] https://github.com/XiaohongGong/jdk/tree/vectorapi-fp-math
> [5] https://github.com/openjdk/jdk/pull/3638
> [6]
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/mi
> cro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.ja
> va
> 
> Any feedback is appreciated! Thanks in advance!
> 
> Best Regards,
> Xiaohong Gong