RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v2]

Fri Sep 27 14:18:41 UTC 2024

On Wed, 17 Jul 2024 22:48:04 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>>> The C2 changes look nice! I just added one comment here about style. It would also be good to add some IR tests checking that the intrinsic is creating `MaxL`/`MinL` nodes before macro expansion, and a microbenchmark to compare results.
>> 
>> Thanks for the review. +1 to the IR tests, I'll work on those.
>> 
>> Re: microbenchmark - what do you have exactly in mind? For vectorization performance there is `ReductionPerf` though it's not a microbenchmark per se. Do you want a microbenchmark for the performance of vectorized max/min long? For non-vectorization performance there is `MathBench`.
>> 
>> I would not expect performance differences in `MathBench` because the backend is still the same and this change really benefits vectorization. I've run the min/max long tests on darwin/aarch64 and linux/x64 and indeed I see no difference:
>> 
>> linux/x64
>> 
>> Benchmark          (seed)   Mode  Cnt        Score       Error   Units
>> MathBench.maxLong       0  thrpt    8  1464197.164 ± 27044.205  ops/ms # base
>> MathBench.minLong       0  thrpt    8  1469917.328 ± 25397.401  ops/ms # base
>> MathBench.maxLong       0  thrpt    8  1469615.250 ± 17950.429  ops/ms # patched
>> MathBench.minLong       0  thrpt    8  1456290.514 ± 44455.727  ops/ms # patched
>> 
>> 
>> darwin/aarch64
>> 
>> Benchmark          (seed)   Mode  Cnt        Score        Error   Units
>> MathBench.maxLong       0  thrpt    8  1739341.447 ? 210983.444  ops/ms # base
>> MathBench.minLong       0  thrpt    8  1659547.649 ? 260554.159  ops/ms # base
>> MathBench.maxLong       0  thrpt    8  1660449.074 ? 254534.725  ops/ms # patched
>> MathBench.minLong       0  thrpt    8  1729728.021 ?  16327.575  ops/ms # patched
>
>> Do you want a microbenchmark for the performance of vectorized max/min long?
> 
> Yeah, I think a simple benchmark that tests for long min/max vectorization and reduction would be good. I worry that checking performance manually like in `ReductionPerf` can lead to harder to interpret results than with a microbenchmark, especially with vm warmup 😅 Thanks for looking into this!

Following the advice from @jaskarth I've worked on a JMH benchmark for this intrinsic.

The benchmarks are pretty straightforward, but the way the data is distributed in the arrays has been designed such that the branch percentage can be controlled. The code uses a random increment/decrement algorithm to distribute the data for testing max. To test min the values are negated. Controlling the branching is an important factor, because the IR/assembly C2 emits can vary depending on the branching characteristics.

First, the non-AVX512 results (only posting max results for brevity, same seen with min):

Benchmark                         (probability)   (size)  Mode  Cnt    Score   Error   Units
MinMaxLoopBench.longReductionMax             50   10000  thrpt    8  107.609 ± 0.149  ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax             80   10000  thrpt    8  107.627 ± 0.150  ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax            100   10000  thrpt    8  238.799 ± 5.028  ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax             50   10000  thrpt    8  107.575 ± 0.088  ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax             80   10000  thrpt    8  107.594 ± 0.072  ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax            100   10000  thrpt    8  107.514 ± 0.067  ops/ms (non-AVX512, patch)

The only situation where this PR is a regression compared to current code is when the one of the branch side is always taken. Why is this? At 50% and 80%, the base code uses cmovlq, but for 100% uses cmp+jump (the branch is for the uncommon trap). The intrinsic the patch adds means that a MinL/MaxL node is always used, and through the macro expansion that always transforms to cmovlq.

Next, the AVX-512 results (note that the results were taken in different machines, so the non-AVX-512 and AVX-512 numbers cannot be compared):

Benchmark                         (probability)  (size)   Mode  Cnt    Score   Error   Units
MinMaxLoopBench.longReductionMax             50   10000  thrpt    8  492.327 ± 0.106  ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax             80   10000  thrpt    8  492.515 ± 0.044  ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax            100   10000  thrpt    8  232.861 ± 5.859  ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax             50   10000  thrpt    8  492.563 ± 0.452  ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax             80   10000  thrpt    8  492.478 ± 0.105  ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax            100   10000  thrpt    8  492.365 ± 0.220  ops/ms (AVX512, patch)

Here we see the same thing as in non-AVX512 systems but the other way around. For the base JDK, at 50-80% the CmpL+Bool gets converted into a CMoveL, and via `CMoveNode::Ideal_minmax` it gets converted to MinL/MaxL nodes, so it behaves just like the patched version. At 100% base adds a cmp+jump (for the uncommon trap branch) and because of flow control vectorization is not applied. The patched version behaves the same way regardless of the branch probability.

For completeness, here are the numbers from ~longLoopMax~, which tests vectorization of min/max without reduction on AVX-512. The pattern is the same:

Benchmark                         (probability)  (size)   Mode  Cnt    Score   Error   Units
MinMaxLoopBench.longLoopMax                  50   10000  thrpt    8   66.959 ± 0.426  ops/ms (AVX512, base)
MinMaxLoopBench.longLoopMax                  80   10000  thrpt    8   66.783 ± 0.342  ops/ms (AVX512, base)
MinMaxLoopBench.longLoopMax                 100   10000  thrpt    8   55.923 ± 0.390  ops/ms (AVX512, base)
MinMaxLoopBench.longLoopMax                  50   10000  thrpt    8   67.044 ± 0.535  ops/ms (AVX512, patch)
MinMaxLoopBench.longLoopMax                  80   10000  thrpt    8   66.600 ± 0.176  ops/ms (AVX512, patch)
MinMaxLoopBench.longLoopMax                 100   10000  thrpt    8   66.672 ± 0.205  ops/ms (AVX512, patch)

Finally, note that the reduction benchmarks only use one array to compute the value. Coming up with a random increment algorithm such that the combination of multiple array values would be higher/lower than the previous one was quite complex.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2379386872