RFR: 8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp [v3]

Sat Jan 9 08:16:55 UTC 2021

On Fri, 8 Jan 2021 09:52:02 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Hi,
>> 
>> As the experiment shows, we do not gain improvements with these.
>> 
>> For `Math.max(max, Math.abs(floatsA[i] - floatsB[i]))`, the code is not vectorized if `COUNT < 12`.
>> When `COUNT == 12`, the node we have is not `Max2F` but `Max4F`.
>> The `Math.max(max, floatsA[i] - floatsB[i])` suffers same problem that it does not match `Max2F` with small `COUNT`.
>> 
>> For `Math.max(max, floatsA[i])`, it is not auto-vectorized even with `COUNT = 512`.
>> I think the auto-vectorized optimization for this is disabled by JDK-8078563 [1].
>> 
>> One of the advantages of `Max2F with fmaxp` can gain for `VectorAPI`, the test code is available in [2].
>> We witnessed about `12%` improvements for `reduceLanes(VectorOperators.MAX)` of `FloatVector.SPECIES_64`:
>> Benchmark                         (COUNT)  (seed)  Mode  Cnt    Score   Error  Units
>> # Kunpeng 916, default
>> VectorReduction2FMinMax.maxRed2F      512       0  avgt   10  667.173 ± 0.576  ns/op
>> VectorReduction2FMinMax.minRed2F      512       0  avgt   10  667.172 ± 0.649  ns/op
>> # Kunpeng 916, with fmaxp/fminp
>> VectorReduction2FMinMax.maxRed2F      512       0  avgt   10  592.404 ± 0.885  ns/op
>> VectorReduction2FMinMax.minRed2F      512       0  avgt   10  592.293 ± 0.607  ns/op
>> 
>> I agree the testcode for `floats` in `VectorReductionFloatingMinMax.java` is contrived.
>> Do you think we should replace the tests for `MinMaxF` in `VectorReductionFloatingMinMax` with tests in [2]?
>> 
>> [1] https://bugs.openjdk.java.net/browse/JDK-8078563
>> [2] [VectorReduction2FMinMax.java.txt](https://github.com/openjdk/jdk/files/5784948/VectorReduction2FMinMax.java.txt)
>
> I don't think the real problem is only the tests, it's that common cases don't get vectorized.
> Can we fix this code so that it works with ```Math.abs()``` ?
> Are there any examples of plausible Java code that benefit from this optimization?

According to the results of `JMH perfasm`, `Math.max(max, Math.abs(floatsA[i] - floatsB[i]))` is vectorized when `COUNT=8` on a X86 platform.
While on aarch64, `floatsB[i] = Math.abs(floatsA[i])` is not vectorized when `COUNT = 10` and we can not match `Max2F` for `AbsF` neither.
I am going to investigate the failed vectorization and see if we can have `Max2F` matched. Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1925