RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v2]
Galder Zamarreño
galder at openjdk.org
Fri Sep 27 14:18:41 UTC 2024
On Wed, 17 Jul 2024 22:48:04 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>>> The C2 changes look nice! I just added one comment here about style. It would also be good to add some IR tests checking that the intrinsic is creating `MaxL`/`MinL` nodes before macro expansion, and a microbenchmark to compare results.
>>
>> Thanks for the review. +1 to the IR tests, I'll work on those.
>>
>> Re: microbenchmark - what do you have exactly in mind? For vectorization performance there is `ReductionPerf` though it's not a microbenchmark per se. Do you want a microbenchmark for the performance of vectorized max/min long? For non-vectorization performance there is `MathBench`.
>>
>> I would not expect performance differences in `MathBench` because the backend is still the same and this change really benefits vectorization. I've run the min/max long tests on darwin/aarch64 and linux/x64 and indeed I see no difference:
>>
>> linux/x64
>>
>> Benchmark (seed) Mode Cnt Score Error Units
>> MathBench.maxLong 0 thrpt 8 1464197.164 ± 27044.205 ops/ms # base
>> MathBench.minLong 0 thrpt 8 1469917.328 ± 25397.401 ops/ms # base
>> MathBench.maxLong 0 thrpt 8 1469615.250 ± 17950.429 ops/ms # patched
>> MathBench.minLong 0 thrpt 8 1456290.514 ± 44455.727 ops/ms # patched
>>
>>
>> darwin/aarch64
>>
>> Benchmark (seed) Mode Cnt Score Error Units
>> MathBench.maxLong 0 thrpt 8 1739341.447 ? 210983.444 ops/ms # base
>> MathBench.minLong 0 thrpt 8 1659547.649 ? 260554.159 ops/ms # base
>> MathBench.maxLong 0 thrpt 8 1660449.074 ? 254534.725 ops/ms # patched
>> MathBench.minLong 0 thrpt 8 1729728.021 ? 16327.575 ops/ms # patched
>
>> Do you want a microbenchmark for the performance of vectorized max/min long?
>
> Yeah, I think a simple benchmark that tests for long min/max vectorization and reduction would be good. I worry that checking performance manually like in `ReductionPerf` can lead to harder to interpret results than with a microbenchmark, especially with vm warmup 😅 Thanks for looking into this!
Following the advice from @jaskarth I've worked on a JMH benchmark for this intrinsic.
The benchmarks are pretty straightforward, but the way the data is distributed in the arrays has been designed such that the branch percentage can be controlled. The code uses a random increment/decrement algorithm to distribute the data for testing max. To test min the values are negated. Controlling the branching is an important factor, because the IR/assembly C2 emits can vary depending on the branching characteristics.
First, the non-AVX512 results (only posting max results for brevity, same seen with min):
Benchmark (probability) (size) Mode Cnt Score Error Units
MinMaxLoopBench.longReductionMax 50 10000 thrpt 8 107.609 ± 0.149 ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax 80 10000 thrpt 8 107.627 ± 0.150 ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax 100 10000 thrpt 8 238.799 ± 5.028 ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax 50 10000 thrpt 8 107.575 ± 0.088 ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax 80 10000 thrpt 8 107.594 ± 0.072 ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax 100 10000 thrpt 8 107.514 ± 0.067 ops/ms (non-AVX512, patch)
The only situation where this PR is a regression compared to current code is when the one of the branch side is always taken. Why is this? At 50% and 80%, the base code uses cmovlq, but for 100% uses cmp+jump (the branch is for the uncommon trap). The intrinsic the patch adds means that a MinL/MaxL node is always used, and through the macro expansion that always transforms to cmovlq.
Next, the AVX-512 results (note that the results were taken in different machines, so the non-AVX-512 and AVX-512 numbers cannot be compared):
Benchmark (probability) (size) Mode Cnt Score Error Units
MinMaxLoopBench.longReductionMax 50 10000 thrpt 8 492.327 ± 0.106 ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax 80 10000 thrpt 8 492.515 ± 0.044 ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax 100 10000 thrpt 8 232.861 ± 5.859 ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax 50 10000 thrpt 8 492.563 ± 0.452 ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax 80 10000 thrpt 8 492.478 ± 0.105 ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax 100 10000 thrpt 8 492.365 ± 0.220 ops/ms (AVX512, patch)
Here we see the same thing as in non-AVX512 systems but the other way around. For the base JDK, at 50-80% the CmpL+Bool gets converted into a CMoveL, and via `CMoveNode::Ideal_minmax` it gets converted to MinL/MaxL nodes, so it behaves just like the patched version. At 100% base adds a cmp+jump (for the uncommon trap branch) and because of flow control vectorization is not applied. The patched version behaves the same way regardless of the branch probability.
For completeness, here are the numbers from ~longLoopMax~, which tests vectorization of min/max without reduction on AVX-512. The pattern is the same:
Benchmark (probability) (size) Mode Cnt Score Error Units
MinMaxLoopBench.longLoopMax 50 10000 thrpt 8 66.959 ± 0.426 ops/ms (AVX512, base)
MinMaxLoopBench.longLoopMax 80 10000 thrpt 8 66.783 ± 0.342 ops/ms (AVX512, base)
MinMaxLoopBench.longLoopMax 100 10000 thrpt 8 55.923 ± 0.390 ops/ms (AVX512, base)
MinMaxLoopBench.longLoopMax 50 10000 thrpt 8 67.044 ± 0.535 ops/ms (AVX512, patch)
MinMaxLoopBench.longLoopMax 80 10000 thrpt 8 66.600 ± 0.176 ops/ms (AVX512, patch)
MinMaxLoopBench.longLoopMax 100 10000 thrpt 8 66.672 ± 0.205 ops/ms (AVX512, patch)
Finally, note that the reduction benchmarks only use one array to compute the value. Coming up with a random increment algorithm such that the combination of multiple array values would be higher/lower than the previous one was quite complex.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2379386872
More information about the core-libs-dev
mailing list