RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3]
Emanuel Peter
epeter at openjdk.org
Fri Mar 1 12:43:55 UTC 2024
On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>> @jaskarth
>>> I've designed this benchmark
>>
>> Nice. Can you also post the generated assembly for Baseline/Patch?
>> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark.
>
> @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0
> And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c
>
> I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20
@jaskarth Thanks for trying such a benchmark!
I have a few ideas and questions now.
1. I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster.
2. You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic?
3. I think you want some code on the `a` path that does not require inlining, just some arithmetic. The longer the chain the better, as it creates large latency. But then you also want something after the if/min/max which has a high latency, so that branch speculation can actually make progress on something, whereas max/min would have to wait until it is finished computing.
I actually have a r**egression case for the current CMove logic**, but it **would apply to your logic in some way I think as well**. See my `testCostDifference` below.
Clean master:
`IfMinMax.testCostDifference avgt 15 889118.284 ± 10638.421 ns/op`
When I disable `PhaseIdealLoop::conditional_move`, without your patch:
`IfMinMax.testCostDifference avgt 15 710629.583 ± 3232.237 ns/op`
Your patch, with `PhaseIdealLoop::conditional_move` disabled:
`IfMinMax.testCostDifference avgt 15 886518.663 ± 1801.308 ns/op`
I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why.
So I suspect you could now take my benchmark, and convert it into non-loop code, and then CMove would not kick in, but your conversion to Max/Min would apply instead. And then you could observe the same regression.
Let me know what you think. Not sure if this regression is important enough, but we need to consider what to do about your patch, as well as the CMove logic that already exists.
@Benchmark
public void testCostDifference(Blackhole blackhole, BenchState state) {
//int hits = 0;
int x = 0xf0f0f0f0; // maybe instead use a random source that is different with every method call?
for (int i = 0; i < 10_000; i++) {
int a = (x ^ 0xffffffff) & 0x07ffffff; // cheap (note: mask affects probability)
int h = x;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
int b = (h & 0x7fffffff); // expensive hashing sequence (node: mask affects probability)
int m = (a < b) ? a : b; // The Min/Max
//hits += (a > b) ? 1 : 0;
//System.out.println("i: " + i + " hits: " + hits + " m: " + m + " a: " + a + " b: " + b);
// Note: the hit probability can be adjusted by changing the masks
// adding or removing the most significant bit has a change of
// about a factor of 2.
// The hashing sequences are there to be expensive, and to randomize the values a bit.
h = m;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
h = (h << 6) + (h << 16) + (h >>> 18) - h;
x = h; // another expensive hashing sequence
}
//System.out.println(10_000 / hits);
blackhole.consume(x);
}
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973121951
More information about the hotspot-compiler-dev
mailing list