RFR: 8296545: C2 Blackholes should allow load optimizations

Tue Nov 8 22:12:22 UTC 2022

If you look at generated code for the JMH benchmark like:

public class ArrayRead {
    @Param({"1", "100", "10000", "1000000"})
    int size;

    int[] is;

    @Setup
    public void setup() {
        is = new int[size];
        for (int c = 0; c < size; c++) {
            is[c] = c;
        }
    }

    @Benchmark
    public void test(Blackhole bh) {
        for (int i = 0; i < is.length; i++) {
            bh.consume(is[i]);
        }
    }
}

...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop.

This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible.

We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. 

Motivational improvements on the test above:

Benchmark        (size)  Mode  Cnt        Score   Error     Units

# Before, full Java blackholes
ArrayRead.test        1  avgt    9        5.422 ±    0.023  ns/op
ArrayRead.test      100  avgt    9      460.619 ±    0.421  ns/op
ArrayRead.test    10000  avgt    9    44697.909 ± 1964.787  ns/op
ArrayRead.test  1000000  avgt    9  4332723.304 ± 2791.324  ns/op

# Before, compiler blackholes
ArrayRead.test        1  avgt    9        1.791 ±    0.007  ns/op
ArrayRead.test      100  avgt    9      114.103 ±    1.677  ns/op
ArrayRead.test    10000  avgt    9     8528.544 ±   52.010  ns/op
ArrayRead.test  1000000  avgt    9  1005139.070 ± 2883.011  ns/op

# After, compiler blackholes
ArrayRead.test        1  avgt    9        1.686 ±    0.006  ns/op  ; ~1.1x better
ArrayRead.test      100  avgt    9       16.249 ±    0.019  ns/op  ; ~7.0x better
ArrayRead.test    10000  avgt    9     1375.265 ±    2.420  ns/op  ; ~6.2x better
ArrayRead.test  1000000  avgt    9   136862.574 ± 1057.100  ns/op  ; ~7.3x better

`-prof perfasm` shows the reason for these improvements clearly:

Before:

          ↗  0x00007f0b54498360:   mov    0xc(%r12,%r10,8),%edx    ; range check 1
   7.97%  │  0x00007f0b54498365:   cmp    %edx,%r11d               
   1.27%  │  0x00007f0b54498368:   jae    0x00007f0b5449838f
          │  0x00007f0b5449836a:   shl    $0x3,%r10
   0.03%  │  0x00007f0b5449836e:   mov    0x10(%r10,%r11,4),%r10d  ; get "is[i]"
   7.76%  │  0x00007f0b54498373:   mov    0x10(%r9),%r10d          ; restore "is"
   0.24%  │  0x00007f0b54498377:   mov    0x3c0(%r15),%rdx         ; safepoint poll, part 1
  17.48%  │  0x00007f0b5449837e:   inc    %r11d                    ; i++
   0.17%  │  0x00007f0b54498381:   test   %eax,(%rdx)              ; safepoint poll, part 2  
  53.26%  │  0x00007f0b54498383:   mov    0xc(%r12,%r10,8),%edx    ; loop index check
   4.84%  │  0x00007f0b54498388:   cmp    %edx,%r11d
   0.31%  ╰  0x00007f0b5449838b:   jl     0x00007f0b54498360          

After:

          ↗  0x00007fa06c49a8b0:   mov    0x2c(%rbp,%r10,4),%r9d   ; stride read
  19.66%  │  0x00007fa06c49a8b5:   mov    0x28(%rbp,%r10,4),%edx
   0.14%  │  0x00007fa06c49a8ba:   mov    0x10(%rbp,%r10,4),%ebx
  22.09%  │  0x00007fa06c49a8bf:   mov    0x14(%rbp,%r10,4),%ebx
   0.21%  │  0x00007fa06c49a8c4:   mov    0x18(%rbp,%r10,4),%ebx
  20.19%  │  0x00007fa06c49a8c9:   mov    0x1c(%rbp,%r10,4),%ebx
   0.04%  │  0x00007fa06c49a8ce:   mov    0x20(%rbp,%r10,4),%ebx
  24.02%  │  0x00007fa06c49a8d3:   mov    0x24(%rbp,%r10,4),%ebx
   0.21%  │  0x00007fa06c49a8d8:   add    $0x8,%r10d               ; i += 8
          │  0x00007fa06c49a8dc:   cmp    %esi,%r10d
   0.07%  ╰  0x00007fa06c49a8df:   jl     0x00007fa06c49a8b0       

Additional testing:
 - [x] Eyeballing JMH Samples `-prof perfasm`
 - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole`
 - [x] Linux x86_64 fastdebug, JDK benchmark corpus

-------------

Commit messages:
 - Fix

Changes: https://git.openjdk.org/jdk/pull/11041/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8296545
  Stats: 128 lines in 3 files changed: 127 ins; 1 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/11041.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041

PR: https://git.openjdk.org/jdk/pull/11041