RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3]

Mon Mar 4 08:21:55 UTC 2024

On Sat, 2 Mar 2024 10:55:41 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0
>> And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c
>> 
>> I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20
>
> @jaskarth It seems we were aware of such issues a long time ago:
> https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86
> So we may actually have use `if` for min/max instead of CMove, at least on some platforms.
> But some platforms may have worse branch predictors, and then we should use CMove more often.

Hey @eme64, first of all, I want to thank you for your detailed analysis, and the added benchmark! I hope to answer your questions below.

> I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster.

I think this is a good point as well, I'll try to design a benchmark for this.

> You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic?

In my benchmark, I found that `testSingleInt` wasn't turned into a `CMove`, but after some more investigation I think this is because of a mistake in my benchmark. I mistakenly select the 0th element of the arrays every time, when I should be randomly selecting indices to prevent a side of the branch from being optimized out. When that change is made, it produces a CMove. I also recall finding cases earlier where CMoves in loops weren't created, but I think this must have been before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451) was integrated. I'll keep searching for more cases, but I tried a few examples and couldn't really find any where a minmax was made but the CMove wasn't.

> I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why.

I think looking at the code in `PhaseIdealLoop::conditional_move`, the primary difference is that CMove code has an additional cost metric for loops, whereas is_minmax only has the `PROB_UNLIKELY_MAG(2)` check that the CMove logic uses when not in a loop. I think this might potentially lead to minmax transforming cases in loops that `CMove` might not have- but that may not necessarily be desireable.

> One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization?
The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply.

I think the primary benefit of this optimization in straight-line code is the tightened bounds of the Min/Max node as compared to the equivalent Phi or `CMove`. If we have `CMove(0, int_bottom)` then its type would be `int_bottom`, as it does a meet over the operands. But if it were a Max instead, its type would be `[0, int_max]`, which is a sharper type. As an example:

int b = ...; // int_bottom
int c = b < 0 ? 0 : b;

if (c < 0) {
  ...; // dead code
}

This example is a bit contrived, but previously that branch would not have been pruned. I found this kind of optimization hard to look for, so I added a temporary field to MaxNode that would only be set to true when MaxNodes were created by is_minmax, and dumped the results of `MaxINode::add_ring` when it was called with the field as true. When running the test suite, I saw there were many cases where this transform was able to create a better type for its operands than an equivalent cmove or phi, and in some cases it was even able to statically determine the operation to be a constant value.

> When I ran `make test TEST="micro:IfMinMax" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"` and checked the generated assembly, I did not find any vector instructions. Could it be that `SIZE=300` is too small? I generally use vector sizes in the range of `10_000`, just to make sure it vectorizes. Maybe it is because I have a avx512 machine with 64byte registers, compared to 32byte registers for AVX2? Not sure.

That is interesting, as when running that command I see vectorization on my machine, at least with `testVector*`. `testReduction*` still needs that patch you linked earlier to work. I will increase the iteration count as suggested though, in case that is the cause of the discrepancy.

> Let me know what you think. Not sure if this regression is important enough, but we need to consider what to do about your patch, as well as the CMove logic that already exists.

I think it is definitely worth it considering this regression, as I want to ideally minimize regressions at all. With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on `CMove` directly, as @merykitty suggested earlier. This would mean we wouldn't need to approximate the `CMove` heuristic in `is_minmax`, and that we would see benefits in tandem with improvements to our `CMove` heuristic. That way if the `CMove` heuristic was changed later to take into account the cost behind the cmp, it would also fix this case. Do you have any thoughts on this @eme64 (and @merykitty)?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1975971570