RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3]

Fri Mar 1 12:43:55 UTC 2024

On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> @jaskarth 
>>> I've designed this benchmark
>> 
>> Nice. Can you also post the generated assembly for Baseline/Patch?
>> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark.
>
> @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0
> And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c
> 
> I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20

@jaskarth Thanks for trying such a benchmark!

I have a few ideas and questions now.

1. I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster.
2. You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic?
3. I think you want some code on the `a` path that does not require inlining, just some arithmetic. The longer the chain the better, as it creates large latency. But then you also want something after the if/min/max which has a high latency, so that branch speculation can actually make progress on something, whereas max/min would have to wait until it is finished computing.

I actually have a r**egression case for the current CMove logic**, but it **would apply to your logic in some way I think as well**. See my `testCostDifference` below.

Clean master:
`IfMinMax.testCostDifference      avgt   15  889118.284 ± 10638.421  ns/op`
When I disable `PhaseIdealLoop::conditional_move`, without your patch:
`IfMinMax.testCostDifference      avgt   15  710629.583 ± 3232.237  ns/op`
Your patch, with `PhaseIdealLoop::conditional_move` disabled:
`IfMinMax.testCostDifference      avgt   15  886518.663 ± 1801.308  ns/op`

I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why.

So I suspect you could now take my benchmark, and convert it into non-loop code, and then CMove would not kick in, but your conversion to Max/Min would apply instead. And then you could observe the same regression.

Let me know what you think. Not sure if this regression is important enough, but we need to consider what to do about your patch, as well as the CMove logic that already exists.

    @Benchmark
    public void testCostDifference(Blackhole blackhole, BenchState state) {
        //int hits = 0;
        int x = 0xf0f0f0f0; // maybe instead use a random source that is different with every method call?
        for (int i = 0; i < 10_000; i++) {
            int a = (x ^ 0xffffffff) & 0x07ffffff; // cheap (note: mask affects probability)

            int h = x;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            int b = (h & 0x7fffffff); // expensive hashing sequence (node: mask affects probability)

            int m = (a < b) ? a : b; // The Min/Max
            //hits += (a > b) ? 1 : 0;
            //System.out.println("i: " + i + " hits: " + hits + " m: " + m + " a: " + a + " b: " + b);

            // Note: the hit probability can be adjusted by changing the masks
            //       adding or removing the most significant bit has a change of
            //       about a factor of 2.

            // The hashing sequences are there to be expensive, and to randomize the values a bit.

            h = m;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            h = (h << 6) + (h << 16) + (h >>> 18) - h;
            x = h; // another expensive hashing sequence
        }
        //System.out.println(10_000 / hits);

        blackhole.consume(x);
    }

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973121951