RFR: 8296545: C2 Blackholes should allow load optimizations [v2]
Vladimir Ivanov
vlivanov at openjdk.org
Wed Nov 9 20:33:29 UTC 2022
On Wed, 9 Nov 2022 11:57:45 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> If you look at generated code for the JMH benchmark like:
>>
>>
>> public class ArrayRead {
>> @Param({"1", "100", "10000", "1000000"})
>> int size;
>>
>> int[] is;
>>
>> @Setup
>> public void setup() {
>> is = new int[size];
>> for (int c = 0; c < size; c++) {
>> is[c] = c;
>> }
>> }
>>
>> @Benchmark
>> public void test(Blackhole bh) {
>> for (int i = 0; i < is.length; i++) {
>> bh.consume(is[i]);
>> }
>> }
>> }
>>
>>
>> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop.
>>
>> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible.
>>
>> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics.
>>
>> Motivational improvements on the test above:
>>
>>
>> Benchmark (size) Mode Cnt Score Error Units
>>
>> # Before, full Java blackholes
>> ArrayRead.test 1 avgt 9 5.422 ± 0.023 ns/op
>> ArrayRead.test 100 avgt 9 460.619 ± 0.421 ns/op
>> ArrayRead.test 10000 avgt 9 44697.909 ± 1964.787 ns/op
>> ArrayRead.test 1000000 avgt 9 4332723.304 ± 2791.324 ns/op
>>
>> # Before, compiler blackholes
>> ArrayRead.test 1 avgt 9 1.791 ± 0.007 ns/op
>> ArrayRead.test 100 avgt 9 114.103 ± 1.677 ns/op
>> ArrayRead.test 10000 avgt 9 8528.544 ± 52.010 ns/op
>> ArrayRead.test 1000000 avgt 9 1005139.070 ± 2883.011 ns/op
>>
>> # After, compiler blackholes
>> ArrayRead.test 1 avgt 9 1.686 ± 0.006 ns/op ; ~1.1x better
>> ArrayRead.test 100 avgt 9 16.249 ± 0.019 ns/op ; ~7.0x better
>> ArrayRead.test 10000 avgt 9 1375.265 ± 2.420 ns/op ; ~6.2x better
>> ArrayRead.test 1000000 avgt 9 136862.574 ± 1057.100 ns/op ; ~7.3x better
>>
>>
>> `-prof perfasm` shows the reason for these improvements clearly:
>>
>> Before:
>>
>>
>> ↗ 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1
>> 7.97% │ 0x00007f0b54498365: cmp %edx,%r11d
>> 1.27% │ 0x00007f0b54498368: jae 0x00007f0b5449838f
>> │ 0x00007f0b5449836a: shl $0x3,%r10
>> 0.03% │ 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]"
>> 7.76% │ 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is"
>> 0.24% │ 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1
>> 17.48% │ 0x00007f0b5449837e: inc %r11d ; i++
>> 0.17% │ 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2
>> 53.26% │ 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check
>> 4.84% │ 0x00007f0b54498388: cmp %edx,%r11d
>> 0.31% ╰ 0x00007f0b5449838b: jl 0x00007f0b54498360
>>
>>
>> After:
>>
>>
>>
>> ↗ 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read
>> 19.66% │ 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx
>> 0.14% │ 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx
>> 22.09% │ 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx
>> 0.21% │ 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx
>> 20.19% │ 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx
>> 0.04% │ 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx
>> 24.02% │ 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx
>> 0.21% │ 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8
>> │ 0x00007fa06c49a8dc: cmp %esi,%r10d
>> 0.07% ╰ 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0
>>
>>
>> Additional testing:
>> - [x] Eyeballing JMH Samples `-prof perfasm`
>> - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole`
>> - [x] Linux x86_64 fastdebug, JDK benchmark corpus
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
>
> Do not touch memory at all
src/hotspot/share/opto/library_call.cpp line 7784:
> 7782: // side effects like breaking the optimizations across the blackhole.
> 7783:
> 7784: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole);
One thing to clear if you decide to keep modeling it as `MemBar`: pass `AliasIdxTop` as `alias_idx` .
-------------
PR: https://git.openjdk.org/jdk/pull/11041
More information about the hotspot-compiler-dev
mailing list