RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Thu Feb 13 11:39:15 UTC 2025

On Mon, 10 Feb 2025 09:26:32 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> @eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results.
>> 
>> Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly).
>> 
>> First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better.
>> 
>> # MinMaxVector AVX-512
>> 
>> Following are results with AVX-512 instructions:
>> 
>> Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt   Baseline     Patch   Units
>> MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4    834.127  3688.961  ops/ms
>> MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   1147.010  3687.721  ops/ms
>> MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   1126.718  1072.812  ops/ms
>> MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   1070.921  1070.538  ops/ms
>> MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4    510.483  1073.081  ops/ms
>> MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4    935.658  1016.910  ops/ms
>> MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4   1007.410   933.774  ops/ms
>> MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4    536.582  1017.337  ops/ms
>> MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4    967.288   966.945  ops/ms
>> MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4    967.327   967.382  ops/ms
>> MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4    849.689   967.327  ops/ms
>> MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4    966.323   967.275  ops/ms
>> MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4    967.340   967.228  ops/ms
>> MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4    880.921   967.233  ops/ms
>> 
>> 
>> ### `longReduction[Min|Max]` performance improves slightly when probability is 100
>> 
>> Without the patch the code uses compare instructions:
>> 
>> 
>>    7.83%  ││││ │││↗  │           0x00007f4f700fb305:   imulq		$0xb, 0x20(%r14, %r8, 8), %rdi
>>           ││││ │││...
>
>> At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.
> 
> I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations.
> 
> At branch 80% you see:
> 
>  1115  CountedLoop  === 1115 598 463  [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   692  LoadL  === 1083 1101 393  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   651  LoadL  === 1095 1101 355  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   747  MaxL  === _ 651 692  [[ 451 ]]  !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   451  StoreL  === 1115 1101 449 747  [[ 1116 454 911 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9;  Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   594  CountedLoopEnd  === 1115 593  [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
> 
> You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think).
> 
> At branch 100% you see:
> 
> 
>  ...

@galderz Thanks for all the explanations, that's really helpful 😊 

**Discussion**

- AVX512: only imprivements.
  - Expecially with probability 100, where before we used the bytecode, which would then create an `unstable_if` with uncommon trap. That meant we could not re-discover the CMove / Max later in the IR. Now that we never inline the bytecode, and just intrinsify directly, we can use `vpmax` and that is faster.
  - Ah, maybe that was all incorrect, though it sounded reasonable. You seem to suggest that we actually did use to inline both branches, but that the issue was that `PhaseIdealLoop::conditional_move` does not like extreme probabilities, and so it did not convert 100% cases to CMove, and so it did not use to vectorize. Right. Getting the probability cutoff just right it a little tricky there, and the precise number can seem strange. But that's a discussion for another day.
  - The reduction case is only improved slightly... at least. Maybe we can further improve the throughput with [this](https://bugs.openjdk.org/browse/JDK-8345245) later on.

 - AVX2: mixed results
   - `longReductionMax/Min`: vector max / min is not implemented. We should investigate why.
   - It seems like the `MaxVL` and `MinVL` (e.g. `vpmaxsq`) instructions are only implemented directly for AVX512, see [this](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4669,2611&text=max_epi64).
   - As you suggested @galderz we could consider implementing it via `cmove` in the backend for `AVX2` and maybe lower. Maybe we can talk with @jatin-bhateja about this. That would probably already be worth it on its own, in a separate RFE. Because I would suspect it could give speedup in the non 100% cases as well. Maybe this would even have to be an RFE that makes it in first, so we don't have regressions here?
   - But even still: just intfinsifying should not get us a regression, because there will always be cases where the auto-vectorizer fails, and so the scalar code should not be slower with your patch than on master, right? So we need to investigate this scalar issue as well.

- VectorReduction2.WithSuperword on AVX-512
  - `long[Min|Max]Simple performance drops considerably`. Yes, this case is not yet supposed to vectorize, I'm working on that - it is the issue with "simple" reductions, i.e. those that do no work other than reduce. Our current reduction heuristic thinks these are not profitable to vectorize - but that is wrong in almost all cases. You even filed an issue for that a while back ;) see https://bugs.openjdk.org/browse/JDK-8345044 and related issues. We could bite the bullet on this, knowing that I'm working on it and it will probably fix that issue, or we just wait a little here. Let's discuss.

- VectorReduction2.NoSuperword on AVX-512 machine
  - Hmm, ok. So we seem to realize that the scalar case is slower with your patch in some cases, because now we have a `cmove` on the critical path, and previously we could just predict the branches, which was faster. Interesting that the number of other instructions has an effect here as well, you seem to see a speedup with the "big" benchmarks, but the "small" and "dot" benchmarks are slower. This is surprising. It would be great if we understood why it behaves this way.

**Summary**

Wow, things are more complicated than I would have thought, I hope you are not too discouraged 🙈 

We seem to have these issues, maybe there are more:
- AVX2 does not have long-vector-min/max implemented. That can be done in a separate RFE.
- Simple reductions do not vectorize, known issue see https://bugs.openjdk.org/browse/JDK-8345044, I'm working on that.
- Scalar reductions are slower with your patch for extreme probabilities. Before, they were done with branches, and branch prediction was fast. Now with cmove or max instructions, the critical path is longer, and that makes things slow. Maybe this could be alleviated by reordering / reassociating the reduction path, see [JDK-8345245](https://bugs.openjdk.org/browse/JDK-8345245). Alternatively, we could convert the `cmove` back to a branch, but for that we would probably need to know the branching probability, which we now do not have any more, right? Tricky. This seems the real issue we need to address and discuss.

@galderz What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656328729