RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)

Mon Sep 9 05:10:07 UTC 2024

On Tue, 3 Sep 2024 07:37:33 GMT, Francesco Nigro <duke at openjdk.org> wrote:

>> Working on it
>
> @galderz in the benchmark did you collected the mispredicts/branches?

@franz1981 No I hadn't done so until now, but I will be tracking those more closely.

Context:

I have been running some reduction JMH benchmarks and I could see a big drop in non AVX-512 performance compared to the unpatched code. E.g.

    @Benchmark
    public long reductionSingleLongMax() {
        long result = 0;
        for (int i = 0; i < size; i++) {
            final long v = 11 * aLong[i];
            result = Math.max(result, v);
        }
        return result;
    }

This is caused by keeping the Max/Min nodes in the IR, which get translated into `cmpq+cmovlq` instructions (via the macro expansion). The code gets unrolled but a dependency chain on the current max value. In the unpatched code the intrinsic does not kick in and uses a standard ternary operation, which gets translated into a normal control flow. The system is able to handle this better due to branch prediction. @franz1981's comment is precisely about this. I need to enhance the benchmark to control the branchiness of the test (e.g. how often it goes one side or the other of a max/min call) and measure the mispredictions and branches...etc.

FYI: A similar situation can be replicated with reduction benchmarks that use max/min integer, but for the code to fallback into `cmov`, both AVX and SSE have be turned off.

I also need to see what the performance looks on like on a system with AVX-512, and also look at how non-reduction JMH benchmarks behave on systems with/without AVX-512.

Finally, I'm also looking at an experiment to see what would happen in cmovl was implemented with branch+mov instead.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2337131179