RFR: 8296545: C2 Blackholes should allow load optimizations
Vladimir Ivanov
vlivanov at openjdk.org
Wed Nov 9 00:47:16 UTC 2022
On Tue, 8 Nov 2022 15:48:01 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> If you look at generated code for the JMH benchmark like:
>
>
> public class ArrayRead {
> @Param({"1", "100", "10000", "1000000"})
> int size;
>
> int[] is;
>
> @Setup
> public void setup() {
> is = new int[size];
> for (int c = 0; c < size; c++) {
> is[c] = c;
> }
> }
>
> @Benchmark
> public void test(Blackhole bh) {
> for (int i = 0; i < is.length; i++) {
> bh.consume(is[i]);
> }
> }
> }
>
>
> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop.
>
> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible.
>
> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics.
>
> Motivational improvements on the test above:
>
>
> Benchmark (size) Mode Cnt Score Error Units
>
> # Before, full Java blackholes
> ArrayRead.test 1 avgt 9 5.422 ± 0.023 ns/op
> ArrayRead.test 100 avgt 9 460.619 ± 0.421 ns/op
> ArrayRead.test 10000 avgt 9 44697.909 ± 1964.787 ns/op
> ArrayRead.test 1000000 avgt 9 4332723.304 ± 2791.324 ns/op
>
> # Before, compiler blackholes
> ArrayRead.test 1 avgt 9 1.791 ± 0.007 ns/op
> ArrayRead.test 100 avgt 9 114.103 ± 1.677 ns/op
> ArrayRead.test 10000 avgt 9 8528.544 ± 52.010 ns/op
> ArrayRead.test 1000000 avgt 9 1005139.070 ± 2883.011 ns/op
>
> # After, compiler blackholes
> ArrayRead.test 1 avgt 9 1.686 ± 0.006 ns/op ; ~1.1x better
> ArrayRead.test 100 avgt 9 16.249 ± 0.019 ns/op ; ~7.0x better
> ArrayRead.test 10000 avgt 9 1375.265 ± 2.420 ns/op ; ~6.2x better
> ArrayRead.test 1000000 avgt 9 136862.574 ± 1057.100 ns/op ; ~7.3x better
>
>
> `-prof perfasm` shows the reason for these improvements clearly:
>
> Before:
>
>
> ↗ 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1
> 7.97% │ 0x00007f0b54498365: cmp %edx,%r11d
> 1.27% │ 0x00007f0b54498368: jae 0x00007f0b5449838f
> │ 0x00007f0b5449836a: shl $0x3,%r10
> 0.03% │ 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]"
> 7.76% │ 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is"
> 0.24% │ 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1
> 17.48% │ 0x00007f0b5449837e: inc %r11d ; i++
> 0.17% │ 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2
> 53.26% │ 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check
> 4.84% │ 0x00007f0b54498388: cmp %edx,%r11d
> 0.31% ╰ 0x00007f0b5449838b: jl 0x00007f0b54498360
>
>
> After:
>
>
>
> ↗ 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read
> 19.66% │ 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx
> 0.14% │ 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx
> 22.09% │ 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx
> 0.21% │ 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx
> 20.19% │ 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx
> 0.04% │ 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx
> 24.02% │ 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx
> 0.21% │ 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8
> │ 0x00007fa06c49a8dc: cmp %esi,%r10d
> 0.07% ╰ 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0
>
>
> Additional testing:
> - [x] Eyeballing JMH Samples `-prof perfasm`
> - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole`
> - [x] Linux x86_64 fastdebug, JDK benchmark corpus
src/hotspot/share/opto/library_call.cpp line 7790:
> 7788: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole);
> 7789: mb->init_req(TypeFunc::Control, control());
> 7790: mb->init_req(TypeFunc::Memory, mem);
Does it need memory at all? In other words, is `Blackhole` still a `MemBar` or can it become a pure control node now?
-------------
PR: https://git.openjdk.org/jdk/pull/11041
More information about the hotspot-compiler-dev
mailing list