RFR: 8296545: C2 Blackholes should allow load optimizations [v2]

Wed Nov 9 11:57:45 UTC 2022

> If you look at generated code for the JMH benchmark like:
> 
> 
> public class ArrayRead {
>     @Param({"1", "100", "10000", "1000000"})
>     int size;
> 
>     int[] is;
> 
>     @Setup
>     public void setup() {
>         is = new int[size];
>         for (int c = 0; c < size; c++) {
>             is[c] = c;
>         }
>     }
> 
>     @Benchmark
>     public void test(Blackhole bh) {
>         for (int i = 0; i < is.length; i++) {
>             bh.consume(is[i]);
>         }
>     }
> }
> 
> 
> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop.
> 
> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible.
> 
> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. 
> 
> Motivational improvements on the test above:
> 
> 
> Benchmark        (size)  Mode  Cnt        Score   Error     Units
> 
> # Before, full Java blackholes
> ArrayRead.test        1  avgt    9        5.422 ±    0.023  ns/op
> ArrayRead.test      100  avgt    9      460.619 ±    0.421  ns/op
> ArrayRead.test    10000  avgt    9    44697.909 ± 1964.787  ns/op
> ArrayRead.test  1000000  avgt    9  4332723.304 ± 2791.324  ns/op
> 
> # Before, compiler blackholes
> ArrayRead.test        1  avgt    9        1.791 ±    0.007  ns/op
> ArrayRead.test      100  avgt    9      114.103 ±    1.677  ns/op
> ArrayRead.test    10000  avgt    9     8528.544 ±   52.010  ns/op
> ArrayRead.test  1000000  avgt    9  1005139.070 ± 2883.011  ns/op
> 
> # After, compiler blackholes
> ArrayRead.test        1  avgt    9        1.686 ±    0.006  ns/op  ; ~1.1x better
> ArrayRead.test      100  avgt    9       16.249 ±    0.019  ns/op  ; ~7.0x better
> ArrayRead.test    10000  avgt    9     1375.265 ±    2.420  ns/op  ; ~6.2x better
> ArrayRead.test  1000000  avgt    9   136862.574 ± 1057.100  ns/op  ; ~7.3x better
> 
> 
> `-prof perfasm` shows the reason for these improvements clearly:
> 
> Before:
> 
> 
>           ↗  0x00007f0b54498360:   mov    0xc(%r12,%r10,8),%edx    ; range check 1
>    7.97%  │  0x00007f0b54498365:   cmp    %edx,%r11d               
>    1.27%  │  0x00007f0b54498368:   jae    0x00007f0b5449838f
>           │  0x00007f0b5449836a:   shl    $0x3,%r10
>    0.03%  │  0x00007f0b5449836e:   mov    0x10(%r10,%r11,4),%r10d  ; get "is[i]"
>    7.76%  │  0x00007f0b54498373:   mov    0x10(%r9),%r10d          ; restore "is"
>    0.24%  │  0x00007f0b54498377:   mov    0x3c0(%r15),%rdx         ; safepoint poll, part 1
>   17.48%  │  0x00007f0b5449837e:   inc    %r11d                    ; i++
>    0.17%  │  0x00007f0b54498381:   test   %eax,(%rdx)              ; safepoint poll, part 2  
>   53.26%  │  0x00007f0b54498383:   mov    0xc(%r12,%r10,8),%edx    ; loop index check
>    4.84%  │  0x00007f0b54498388:   cmp    %edx,%r11d
>    0.31%  ╰  0x00007f0b5449838b:   jl     0x00007f0b54498360          
> 
> 
> After:
> 
> 
> 
>           ↗  0x00007fa06c49a8b0:   mov    0x2c(%rbp,%r10,4),%r9d   ; stride read
>   19.66%  │  0x00007fa06c49a8b5:   mov    0x28(%rbp,%r10,4),%edx
>    0.14%  │  0x00007fa06c49a8ba:   mov    0x10(%rbp,%r10,4),%ebx
>   22.09%  │  0x00007fa06c49a8bf:   mov    0x14(%rbp,%r10,4),%ebx
>    0.21%  │  0x00007fa06c49a8c4:   mov    0x18(%rbp,%r10,4),%ebx
>   20.19%  │  0x00007fa06c49a8c9:   mov    0x1c(%rbp,%r10,4),%ebx
>    0.04%  │  0x00007fa06c49a8ce:   mov    0x20(%rbp,%r10,4),%ebx
>   24.02%  │  0x00007fa06c49a8d3:   mov    0x24(%rbp,%r10,4),%ebx
>    0.21%  │  0x00007fa06c49a8d8:   add    $0x8,%r10d               ; i += 8
>           │  0x00007fa06c49a8dc:   cmp    %esi,%r10d
>    0.07%  ╰  0x00007fa06c49a8df:   jl     0x00007fa06c49a8b0       
> 
> 
> Additional testing:
>  - [x] Eyeballing JMH Samples `-prof perfasm`
>  - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole`
>  - [x] Linux x86_64 fastdebug, JDK benchmark corpus

Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:

  Do not touch memory at all

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/11041/files
  - new: https://git.openjdk.org/jdk/pull/11041/files/5a91ed9a..1ca2febe

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=00-01

  Stats: 8 lines in 2 files changed: 0 ins; 5 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/11041.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041

PR: https://git.openjdk.org/jdk/pull/11041