RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)
Galder Zamarreño
galder at openjdk.org
Mon Sep 9 05:10:07 UTC 2024
On Tue, 3 Sep 2024 07:37:33 GMT, Francesco Nigro <duke at openjdk.org> wrote:
>> Working on it
>
> @galderz in the benchmark did you collected the mispredicts/branches?
@franz1981 No I hadn't done so until now, but I will be tracking those more closely.
Context:
I have been running some reduction JMH benchmarks and I could see a big drop in non AVX-512 performance compared to the unpatched code. E.g.
@Benchmark
public long reductionSingleLongMax() {
long result = 0;
for (int i = 0; i < size; i++) {
final long v = 11 * aLong[i];
result = Math.max(result, v);
}
return result;
}
This is caused by keeping the Max/Min nodes in the IR, which get translated into `cmpq+cmovlq` instructions (via the macro expansion). The code gets unrolled but a dependency chain on the current max value. In the unpatched code the intrinsic does not kick in and uses a standard ternary operation, which gets translated into a normal control flow. The system is able to handle this better due to branch prediction. @franz1981's comment is precisely about this. I need to enhance the benchmark to control the branchiness of the test (e.g. how often it goes one side or the other of a max/min call) and measure the mispredictions and branches...etc.
FYI: A similar situation can be replicated with reduction benchmarks that use max/min integer, but for the code to fallback into `cmov`, both AVX and SSE have be turned off.
I also need to see what the performance looks on like on a system with AVX-512, and also look at how non-reduction JMH benchmarks behave on systems with/without AVX-512.
Finally, I'm also looking at an experiment to see what would happen in cmovl was implemented with branch+mov instead.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2337131179
More information about the core-libs-dev
mailing list