RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v2]
Emanuel Peter
epeter at openjdk.org
Mon Feb 26 10:34:54 UTC 2024
On Thu, 22 Feb 2024 17:08:01 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>> Maybe it would be nice if the min/max nodes had some internal probability, so that the backend could decide again in what circumstances it is profitable to turn it into a cmove, or max, or even branch?
>
> Hey @eme64, thank you for leaving feedback on the PR! I've incorporated your suggestions to make the code cleaner, and hopefully it's easier to follow now. I would appreciate it if you could take a look at the new changes!
>
>> Feel free to make this change (hack), and see if that would get you the desired results:
>
> I applied this patch and it seems to have worked, here's the latest performance results with the change and changes from master:
>
> Baseline Patch Improvement
> Benchmark Mode Cnt Score Error Units Score Error Units
> IfMinMax.testReductionInt avgt 15 124.153 ± 0.963 ns/op 12.793 ± 1.171 ns/op + 9.7x
> IfMinMax.testReductionLong avgt 15 123.241 ± 17.596 ns/op 126.727 ± 0.702 ns/op (no change)*
> IfMinMax.testSingleInt avgt 15 0.916 ± 0.009 ns/op 0.913 ± 0.008 ns/op (no change)
> IfMinMax.testSingleLong avgt 15 0.911 ± 0.007 ns/op 0.905 ± 0.004 ns/op (no change)
> IfMinMax.testVectorInt avgt 15 83.884 ± 1.861 ns/op 14.766 ± 1.166 ns/op + 5.7x
> IfMinMax.testVectorLong avgt 15 81.426 ± 0.517 ns/op 28.515 ± 0.480 ns/op + 2.9x
>
>
> I can now measure an improvement with reductions. The difference in the long reduction test seems to be due to the fact that my (Zen 3) device doesn't support AVX-512, which long min/max reductions require on x86 because of `vpminsq`/`vpmaxsq`.
>
>> From your description and the benchmark results, I think this issue is applied to a general `CMove`. While a `CMove` would simplify the control flow, enable more optimisations, especially vectorisation; a `CMove` may deteriorate the performance if the corresponding branch is predictable. As a result, I don't think only applying the transformations to min/max patterns is a sound solution.
>
> I definitely agree that this transformation could be further reaching in the patterns it accepts, but I wanted to focus on expanding the set of code that our existing Min/Max optimizations could cover for this patch. I've experimented with `CMove` optimizations before but I've ran into cases where the performance regresses, so I think they need to be done with a lot of care and research. Did you have ideas on what other kinds of patterns you would like to see covered? Thanks!
@jaskarth the code looks cleaner already, good job 😊
@merykitty @shipilev @jaskarth it seems to me that attacking `CMove` is much more tricky. The branch-predictor is very good on some platforms, and not so good on others. In some cases, misprediction has quite a penalty.
Min/Max is a much simpler pattern, so it could make sense to treat it differently.
- CMove: `(a cmp b) ? c : d`. Assume that `a` and `b` are expensive and take a while to compute. But `c` and `d` are cheap. Assuming the branch is predicatable enough, then we can already speculatively branch and continue without knowing `a` and `b`. A bit later, we can confirm the speculation, or flush the pipeline if it was a misprediction.
- Max/Min: `(a cmp b) ? a : b`. You already need at least one of `a` and `b` to speculatively continue. Hence, the benefit of speculation is less clear here.
Just FYI: I am slowly refactoring SuperWord / AutoVectorizer, and my hope is that I could eventually do if-conversion for loops. That means I would convert CFG to CMove (with min/max as special cases). But that is probably many months in the future. If we can vectorize, then CMove and min/max is often more clearly profitable: we probably execute both branches anyway. For straight-line code, we have tradeoffs with latency vs misprediction costs.
@jaskarth can you construct a benchmark with a `(a cmp b) ? a : b` where you now convert to min/max, and we very often take the `a` branch, and where the `b` value is quite expensive, and the `a` value is very cheap? This would be an example where it might be more profitable to speculate on `a` (cheap), rather than to force computation of `b` before the min/max can be computed. Not sure how easy it is to write such a benchmark, but I'd love to see it ;)
@jaskarth what would it take to do the same for `float/double`? Maybe it is not so easy, I don't remember exactly, but the comparators may have a different semantics compared to min/max, especially for NaN values. But it would be nice if you explained this here.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1963791996
PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1963796439
More information about the hotspot-compiler-dev
mailing list