RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3]

Emanuel Peter epeter at openjdk.org
Mon Mar 4 09:02:58 UTC 2024


On Mon, 4 Mar 2024 08:18:41 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> @jaskarth It seems we were aware of such issues a long time ago:
>> https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86
>> So we may actually have use `if` for min/max instead of CMove, at least on some platforms.
>> But some platforms may have worse branch predictors, and then we should use CMove more often.
>
> Hey @eme64, first of all, I want to thank you for your detailed analysis, and the added benchmark! I hope to answer your questions below.
> 
>> I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster.
> 
> I think this is a good point as well, I'll try to design a benchmark for this.
> 
>> You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic?
> 
> In my benchmark, I found that `testSingleInt` wasn't turned into a `CMove`, but after some more investigation I think this is because of a mistake in my benchmark. I mistakenly select the 0th element of the arrays every time, when I should be randomly selecting indices to prevent a side of the branch from being optimized out. When that change is made, it produces a CMove. I also recall finding cases earlier where CMoves in loops weren't created, but I think this must have been before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451) was integrated. I'll keep searching for more cases, but I tried a few examples and couldn't really find any where a minmax was made but the CMove wasn't.
> 
>> I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why.
> 
> I think looking at the code in `PhaseIdealLoop::conditional_move`, the primary difference is that CMove code has an additional cost metric for loops, whereas is_minmax only has the `PROB_UNLIKELY_MAG(2)` check that the CMove logic uses when not in a loop. I think this might potentially lead to minmax transforming cases in loops that `CMove` might not have- but that may not necessarily be desireable.
> 
>> One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization?
> The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply.
> 
> I think the primary benefit of this optimization in straight-line code is the tightened bounds of the Min/Max node as compared to the equivalent Ph...

@jaskarth 
> With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on CMove directly

You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? I guess that would allow you to get better types, without having to deal with all the CMove-vs-branch-prediction heuristics.

BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday:
`Branchless Programming in C++ - Fedor Pikus - CppCon 2021`
https://www.youtube.com/watch?v=g-WPhYREFjk

My conclusion from that: it is really hard to say ahead of time if the branch-predictor is successful. It depends on how predictable a condition is. The branch-predictor can see patterns (like alternating true-false). So even if a probability is 50% on a branch, it may be fully predictable, and branching code is much more efficient than branchless code. But in totally random cases, branchless code may be faster because you will have a large percentage of mispredictions, and mispredictions are expensive. But in both cases you would see `iff->_prob = 0.5`.
Really what we would need is profiling that checks how much a branch was `mispredicted`, and not how much it was `taken`. But not sure if we can even get that profiling data.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1976054500


More information about the hotspot-compiler-dev mailing list