RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v2]

Emanuel Peter epeter at openjdk.org
Mon Feb 26 10:34:54 UTC 2024


On Thu, 22 Feb 2024 17:08:01 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> Maybe it would be nice if the min/max nodes had some internal probability, so that the backend could decide again in what circumstances it is profitable to turn it into a cmove, or max, or even branch?
>
> Hey @eme64, thank you for leaving feedback on the PR! I've incorporated your suggestions to make the code cleaner, and hopefully it's easier to follow now. I would appreciate it if you could take a look at the new changes!
> 
>> Feel free to make this change (hack), and see if that would get you the desired results:
> 
> I applied this patch and it seems to have worked, here's the latest performance results with the change and changes from master:
> 
>                                                 Baseline                     Patch           Improvement
> Benchmark                   Mode   Cnt  Score      Error  Units    Score       Error  Units
> IfMinMax.testReductionInt   avgt   15  124.153 ±   0.963  ns/op     12.793 ±  1.171  ns/op   + 9.7x
> IfMinMax.testReductionLong  avgt   15  123.241 ±  17.596  ns/op    126.727 ±  0.702  ns/op   (no change)*
> IfMinMax.testSingleInt      avgt   15    0.916 ±   0.009  ns/op      0.913 ±  0.008  ns/op   (no change)
> IfMinMax.testSingleLong     avgt   15    0.911 ±   0.007  ns/op      0.905 ±  0.004  ns/op   (no change)
> IfMinMax.testVectorInt      avgt   15   83.884 ±   1.861  ns/op     14.766 ±  1.166  ns/op   + 5.7x
> IfMinMax.testVectorLong     avgt   15   81.426 ±   0.517  ns/op     28.515 ±  0.480  ns/op   + 2.9x
> 
> 
> I can now measure an improvement with reductions. The difference in the long reduction test seems to be due to the fact that my (Zen 3) device doesn't support AVX-512, which long min/max reductions require on x86 because of `vpminsq`/`vpmaxsq`.
> 
>> From your description and the benchmark results, I think this issue is applied to a general `CMove`. While a `CMove` would simplify the control flow, enable more optimisations, especially vectorisation; a `CMove` may deteriorate the performance if the corresponding branch is predictable. As a result, I don't think only applying the transformations to min/max patterns is a sound solution.
> 
> I definitely agree that this transformation could be further reaching in the patterns it accepts, but I wanted to focus on expanding the set of code that our existing Min/Max optimizations could cover for this patch. I've experimented with `CMove` optimizations before but I've ran into cases where the performance regresses, so I think they need to be done with a lot of care and research. Did you have ideas on what other kinds of patterns you would like to see covered? Thanks!

@jaskarth the code looks cleaner already, good job 😊 

@merykitty @shipilev @jaskarth it seems to me that attacking `CMove` is much more tricky. The branch-predictor is very good on some platforms, and not so good on others. In some cases, misprediction has quite a penalty.

Min/Max is a much simpler pattern, so it could make sense to treat it differently.
- CMove: `(a cmp b) ? c : d`. Assume that `a` and `b` are expensive and take a while to compute. But `c` and `d` are cheap. Assuming the branch is predicatable enough, then we can already speculatively branch and continue without knowing `a` and `b`. A bit later, we can confirm the speculation, or flush the pipeline if it was a misprediction.
- Max/Min: `(a cmp b) ? a : b`. You already need at least one of `a` and `b` to speculatively continue. Hence, the benefit of speculation is less clear here.

Just FYI: I am slowly refactoring SuperWord / AutoVectorizer, and my hope is that I could eventually do if-conversion for loops. That means I would convert CFG to CMove (with min/max as special cases). But that is probably many months in the future. If we can vectorize, then CMove and min/max is often more clearly profitable: we probably execute both branches anyway. For straight-line code, we have tradeoffs with latency vs misprediction costs.

@jaskarth can you construct a benchmark with a `(a cmp b) ? a : b` where you now convert to min/max, and we very often take the `a` branch, and where the `b` value is quite expensive, and the `a` value is very cheap? This would be an example where it might be more profitable to speculate on `a` (cheap), rather than to force computation of `b` before the min/max can be computed. Not sure how easy it is to write such a benchmark, but I'd love to see it ;)

@jaskarth what would it take to do the same for `float/double`? Maybe it is not so easy, I don't remember exactly, but the comparators may have a different semantics compared to min/max, especially for NaN values. But it would be nice if you explained this here.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1963791996
PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1963796439


More information about the hotspot-compiler-dev mailing list