RFR: 8296545: C2 Blackholes should allow load optimizations [v2]

Wed Nov 9 20:33:29 UTC 2022

On Wed, 9 Nov 2022 11:57:45 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> If you look at generated code for the JMH benchmark like:
>> 
>> 
>> public class ArrayRead {
>>     @Param({"1", "100", "10000", "1000000"})
>>     int size;
>> 
>>     int[] is;
>> 
>>     @Setup
>>     public void setup() {
>>         is = new int[size];
>>         for (int c = 0; c < size; c++) {
>>             is[c] = c;
>>         }
>>     }
>> 
>>     @Benchmark
>>     public void test(Blackhole bh) {
>>         for (int i = 0; i < is.length; i++) {
>>             bh.consume(is[i]);
>>         }
>>     }
>> }
>> 
>> 
>> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop.
>> 
>> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible.
>> 
>> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. 
>> 
>> Motivational improvements on the test above:
>> 
>> 
>> Benchmark        (size)  Mode  Cnt        Score   Error     Units
>> 
>> # Before, full Java blackholes
>> ArrayRead.test        1  avgt    9        5.422 ±    0.023  ns/op
>> ArrayRead.test      100  avgt    9      460.619 ±    0.421  ns/op
>> ArrayRead.test    10000  avgt    9    44697.909 ± 1964.787  ns/op
>> ArrayRead.test  1000000  avgt    9  4332723.304 ± 2791.324  ns/op
>> 
>> # Before, compiler blackholes
>> ArrayRead.test        1  avgt    9        1.791 ±    0.007  ns/op
>> ArrayRead.test      100  avgt    9      114.103 ±    1.677  ns/op
>> ArrayRead.test    10000  avgt    9     8528.544 ±   52.010  ns/op
>> ArrayRead.test  1000000  avgt    9  1005139.070 ± 2883.011  ns/op
>> 
>> # After, compiler blackholes
>> ArrayRead.test        1  avgt    9        1.686 ±    0.006  ns/op  ; ~1.1x better
>> ArrayRead.test      100  avgt    9       16.249 ±    0.019  ns/op  ; ~7.0x better
>> ArrayRead.test    10000  avgt    9     1375.265 ±    2.420  ns/op  ; ~6.2x better
>> ArrayRead.test  1000000  avgt    9   136862.574 ± 1057.100  ns/op  ; ~7.3x better
>> 
>> 
>> `-prof perfasm` shows the reason for these improvements clearly:
>> 
>> Before:
>> 
>> 
>>           ↗  0x00007f0b54498360:   mov    0xc(%r12,%r10,8),%edx    ; range check 1
>>    7.97%  │  0x00007f0b54498365:   cmp    %edx,%r11d               
>>    1.27%  │  0x00007f0b54498368:   jae    0x00007f0b5449838f
>>           │  0x00007f0b5449836a:   shl    $0x3,%r10
>>    0.03%  │  0x00007f0b5449836e:   mov    0x10(%r10,%r11,4),%r10d  ; get "is[i]"
>>    7.76%  │  0x00007f0b54498373:   mov    0x10(%r9),%r10d          ; restore "is"
>>    0.24%  │  0x00007f0b54498377:   mov    0x3c0(%r15),%rdx         ; safepoint poll, part 1
>>   17.48%  │  0x00007f0b5449837e:   inc    %r11d                    ; i++
>>    0.17%  │  0x00007f0b54498381:   test   %eax,(%rdx)              ; safepoint poll, part 2  
>>   53.26%  │  0x00007f0b54498383:   mov    0xc(%r12,%r10,8),%edx    ; loop index check
>>    4.84%  │  0x00007f0b54498388:   cmp    %edx,%r11d
>>    0.31%  ╰  0x00007f0b5449838b:   jl     0x00007f0b54498360          
>> 
>> 
>> After:
>> 
>> 
>> 
>>           ↗  0x00007fa06c49a8b0:   mov    0x2c(%rbp,%r10,4),%r9d   ; stride read
>>   19.66%  │  0x00007fa06c49a8b5:   mov    0x28(%rbp,%r10,4),%edx
>>    0.14%  │  0x00007fa06c49a8ba:   mov    0x10(%rbp,%r10,4),%ebx
>>   22.09%  │  0x00007fa06c49a8bf:   mov    0x14(%rbp,%r10,4),%ebx
>>    0.21%  │  0x00007fa06c49a8c4:   mov    0x18(%rbp,%r10,4),%ebx
>>   20.19%  │  0x00007fa06c49a8c9:   mov    0x1c(%rbp,%r10,4),%ebx
>>    0.04%  │  0x00007fa06c49a8ce:   mov    0x20(%rbp,%r10,4),%ebx
>>   24.02%  │  0x00007fa06c49a8d3:   mov    0x24(%rbp,%r10,4),%ebx
>>    0.21%  │  0x00007fa06c49a8d8:   add    $0x8,%r10d               ; i += 8
>>           │  0x00007fa06c49a8dc:   cmp    %esi,%r10d
>>    0.07%  ╰  0x00007fa06c49a8df:   jl     0x00007fa06c49a8b0       
>> 
>> 
>> Additional testing:
>>  - [x] Eyeballing JMH Samples `-prof perfasm`
>>  - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole`
>>  - [x] Linux x86_64 fastdebug, JDK benchmark corpus
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Do not touch memory at all

src/hotspot/share/opto/library_call.cpp line 7784:

> 7782:   // side effects like breaking the optimizations across the blackhole.
> 7783: 
> 7784:   MemBarNode* mb = MemBarNode::make(C, Op_Blackhole);

One thing to clear if you decide to keep modeling it as `MemBar`: pass `AliasIdxTop` as `alias_idx` .

-------------

PR: https://git.openjdk.org/jdk/pull/11041