RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Thu Feb 13 11:49:18 UTC 2025

On Mon, 10 Feb 2025 09:26:32 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> @eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results.
>> 
>> Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly).
>> 
>> First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better.
>> 
>> # MinMaxVector AVX-512
>> 
>> Following are results with AVX-512 instructions:
>> 
>> Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt   Baseline     Patch   Units
>> MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4    834.127  3688.961  ops/ms
>> MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   1147.010  3687.721  ops/ms
>> MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   1126.718  1072.812  ops/ms
>> MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   1070.921  1070.538  ops/ms
>> MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4    510.483  1073.081  ops/ms
>> MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4    935.658  1016.910  ops/ms
>> MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4   1007.410   933.774  ops/ms
>> MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4    536.582  1017.337  ops/ms
>> MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4    967.288   966.945  ops/ms
>> MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4    967.327   967.382  ops/ms
>> MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4    849.689   967.327  ops/ms
>> MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4    966.323   967.275  ops/ms
>> MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4    967.340   967.228  ops/ms
>> MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4    880.921   967.233  ops/ms
>> 
>> 
>> ### `longReduction[Min|Max]` performance improves slightly when probability is 100
>> 
>> Without the patch the code uses compare instructions:
>> 
>> 
>>    7.83%  ││││ │││↗  │           0x00007f4f700fb305:   imulq		$0xb, 0x20(%r14, %r8, 8), %rdi
>>           ││││ │││...
>
>> At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.
> 
> I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations.
> 
> At branch 80% you see:
> 
>  1115  CountedLoop  === 1115 598 463  [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   692  LoadL  === 1083 1101 393  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   651  LoadL  === 1095 1101 355  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   747  MaxL  === _ 651 692  [[ 451 ]]  !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   451  StoreL  === 1115 1101 449 747  [[ 1116 454 911 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9;  Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   594  CountedLoopEnd  === 1115 593  [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
> 
> You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think).
> 
> At branch 100% you see:
> 
> 
>  ...

@galderz How sure are that intrinsifying directly is really the right approach?

Maybe the approach via `PhaseIdealLoop::conditional_move` where we know the branching probability is a better one. Though of course knowing the branching probability is no perfect heuristic for how good branch prediction is going to be, but it is at least something.

So I'm wondering if there could be a different approach that sees all the wins you get here, without any of the regressions?

If we are just interested in better vectorization: the current issue is that the auto-vectorizer cannot handle CFG, i.e. we do not yet do if-conversion. But if we had if-conversion, then the inlined CFG of min/max would just be converted to vector CMove (or vector min/max where available) at that point. We can take the branching probabilities into account, just like `PhaseIdealLoop::conditional_move` does - if that is necessary. Of course if-conversion is far away, and we will encounter a lot of issues with branch prediction etc, so I'm scared we might never get there - but I want to try ;)

Do we see any other wins with your patch, that are not due to vectorization, but just scalar code?

@galderz Maybe we can discuss this offline at some point as well :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656350896
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656351785