Performance impact of decommissioning arrayStorageProperties to legacy code.

Tue Jun 9 04:43:20 UTC 2020

To continue:

Performance analysis of aastore for the case of Object array.

(text of this email is duplicates on 
http://cr.openjdk.java.net/~skuksenko/valhalla/reports/aastore/aastore.txt )

*** Performance impact of decommissioning arrayStorageProperties to 
legacy code.

   Note: By legacy code I mean java code written in before Valhalla 
world, a.k.a. Java code without inline classes.

   Note: Analysis of performance impact to inline types in progress.

   The difference was considered between jdk-15-valhalla+1-72 and 
jdk-15-valhalla+1-66 which covers all related hotspot modifications.

   Later "baseline" means behavior when Valhalla is turned off 
(XX:-EnableValhalla), it was not found any difference in baseline 
behavior between build-66 and build-72. V-66 and v-72 mean corresponding 
Valhalla versions.

   Analysis of aaload operation see here 
http://cr.openjdk.java.net/~skuksenko/valhalla/reports/aaload/aaload.txt

1. Benchmark

    Use the simplest benchmark, write to array of Object. Like:

     Object[] a;

     Object x;

     @Benchmark

     public void write() {

         Object x = this.x;

         for (int i = 0; i < size; i++) {

             a[i] = x;

         }

     }

   We don't need to use JMH's Blackhole here (side effect is provided by 
the fact writing to array). But it leads to consequences - C2 hoists 
array reference (and corresponding checks) out of loop and perfectly 
unroll loop. Having our goal - to measure cost of the single aastore 
operation - this should be avoided.

   Note: Unroll and out of hoisting was happened only for ZGC, 
ParallelGC and EpsilonGC. It was not done for G1 by unknown reason. 
Maybe this need attention.

   Unroll was suppressed by -XX:LoopMaxUnroll=1 option. Out of loop 
hoisting was suppressed by making array reference volatile. Cost of 
volatile read on x86 doesn't differ from the cost of ordinary read.

   The other aspect which should be analyzed is dependency from GC and 
GC write barrier. All existing GCs were used: G1 and ParallelGC (there 
are write barriers) and ZGC and EpsilonGC (no write barriers).

2. Here are performance results:

                                |baseline(ns)| v-66 (ns) | v-72 (ns) | v-66/baseline | v-72/baseline |

G1GC (compressedOops)          :    257     |    276    |    257    |     -7.4%     |        ~0     |

G1GC (uncompressedOops)        :    243     |    266    |    243    |     -9.5%     |        ~0     |

ParallelGC (compressedOops)    :    225     |    227    |    225    |     -0.9%     |        ~0     |

ParallelGC (uncompressedOops)  :    194     |    222    |    194    |    -14.4%     |        ~0     |

ZGC (uncompressedOops)         :    198     |    204    |    198    |     -3.0%     |        ~0     |

EpsisonGC (compressedOops)     :    182     |    192    |    182    |     -5.5%     |        ~0     |

EpsisonGC (uncompressedOops)   :    175     |    175    |    175    |       ~0      |        ~0     |

   Decommission arrayStorageProperties has positive performance effect 
for aastore operation in any conditions. The really nice fact that 
aastore completely doesn't have negative performance effects for legacy 
code in Valhalla.   The fact is klass ptr is loaded for every aastore 
operation and checked if runtime of the array is Object[] (for this 
benchmark it's the simplest form of array store check). In v-66 
arrayStorageProperties bits clearing should be done.
   In v-72 there are no Valhalla checks at all (we already checked if 
it's Object[] - don't need to do anything else).
   Looking into generated assembly code I didn't find any differences 
between baseline code and v-72 Valhalla code.

--------

So, having 4 cases (aastore/aaload * compressedOops/ uncompressedOops)  
decommissioning arrayStorageProperties  gave performance win in 3 of them.

On 6/6/20 12:30 AM, Sergey Kuksenko wrote:
> (text of this email is duplicates 
> onhttp://cr.openjdk.java.net/~skuksenko/valhalla/reports/aaload/aaload.txt)
>
> *** Performance impact of decommissioning arrayStorageProperties to 
> legacy code.
>
>   Note: By legacy code I mean java code written in before Valhalla 
> world, a.k.a. Java code without inline classes.
>
>   Note: Analysis of performance impact to inline types in progress.
>
>   The difference was considered between jdk-15-valhalla+1-72 and 
> jdk-15-valhalla+1-66 which covers all related hotspot modifications.
>
>   Later "baseline" means behavior when Valhalla is turned off 
> (XX:-EnableValhalla), it was not found any difference in baseline 
> behavior between build-66 and build-72.
>   V-66 and v-72 mean corresponding Valhalla versions.
>
> 1. General picture.
>
>   It was checked ~160 benchmarks. ~30 of them are big or middle size 
> 3d party benchmarks (SPEC..., Dacapo, Volano) all others are some 
> subset of our microbenchmarks base. Only -XX:+UseCompressedOops was 
> checked.
>
>   Not significant changes were found.
>
>   - 16 benchmarks got speedup (v-66 -> v-72) typically around +5% 
> (some up to +10%)
>   - 14 benchmarks got degradation (v-66 -> v-72) typically around -5% 
> (some -10%)
>
>   In the checked benchmarkbase major amount of benchmarks have the 
> same performance as baseline, but 15 benchmarks are slower than 
> baseline withing -10% (baseline vs v-72).
>
>   From one side having the fact that it's typical for Valhalla changes 
> cause benchmarks jittering within 10%, we are not consider performance 
> changes less than 10% as significant
>   (this threshold will be lowered with Valhalla maturity). From the 
> other side 10% of benchmarks (from the selected set) degrade with 
> Valhalla, that means we can't leave it as is
>   and should solve it sooner or later, otherwise there is a high 
> chance of negative acceptance by Java community.
>
>
> 2. Detailed "aaload" analysis (other arrays operations are in progress).
>
>   For analysis was used the simplest benchmark like:
>
>     Object[] a1;
>         @Benchmark
>     public void read(Blackhole bh) {
>         for (int i = 0; i < size; i++) {
>             bh.consume(a1[i]);
>         }
>     }
>
>   Array reference is loaded from the field (a1) on each iteration 
> intentionally, having the fact that hotspot is pretty good at array 
> check out of loop hoisting (at least for the single array).
>   Object[] is used by the similar reason - hotspot doesn't generate 
> "array of inline" checks if it's possible to prove that inline types 
> can't be used here. In the used microbenchmark the check
>   if-array-is-flattened is performed on each iteration.
>     Used array size==100 (checking larger arrays didn't show unique 
> behavior for this particular benchmark).
>     * v-66 -> v-72
>     The benchmark performance depends on compressed or uncompressed 
> oops are used. Moreover, compressed oops kind of (base) are different 
> from compressed oops (base+shift) and v-72 also depends on
>   if klass pointer was compressed.
>     Here are results, time in nanoseconds.
>                               |   baseline |   v-66  |  v-72 | (v-72 + 
> -XX:-UseCompressedClassPointers)
>   CompressedOops(base)      :     485    |   555   |  645  |  630
>   CompressedOops(base+shift):     500    |   620   |  700  |  650
>   UncompressedOops          :     530    |   655   |  570  |
>
>   Decommission of arrayStorageProperties leads to +13% speedup for 
> uncompressed oops case and -16% (base)  and -13% (base+shift) 
> degradation for compressed oops (v-66 vs v-72).
>     Here we can see how much each valhalla version is slower than the 
> baseline:
>                               |     v-66   |   v-72
>   CompressedOops(base)      :     -14%   |   -33%
>   CompressedOops(base+shift):     -24%   |   -40%
>   UncompressedOops          :     -24%   |    -8%
>
>     In uncompressed oops case we got really positive result, but 
> compressed oops got significant slowdown. Please note, all time and 
> ratios above are related to performance of the benchmark, not to
>   performance of "aaload" operation. JMH code around the benchmark has 
> effect and than smaller examined operation than larger that effect.
>     Performance degradation in compressed oops case caused by set of 
> chained reasons:
>     - check tag in Klass -> additional dereference
>     - unpack klass pointer -> used the same scratch register for base 
> compressed klass as base register for compressed oops (r12) -> more 
> instructions to manage base address register
>     - checking tag is not single bit -> extra register is required -> 
> more register spilling
>     Thorough profiling and throwing out JMH impact have shown that 
> "v-66 compressed aaload" is 2x times slower than baseline aaload, when 
> "v-72 compressed aaload" is 3x times slower than baseline.
>   "v-66 uncompressed aaload" is 1.5x times slower than baseline 
> aaload, when "v-72 uncompressed aaload" is 1.3x times slower". The key 
> reason is larger amount of instructions, there are no cache
>   or memory behavior differences between v-66 and v-72.
>     Here is compressed v-72 code with some comments and questions:
>      mov    0x10(%r10),%ebp              ; #1 *getfield, load 
> reference to array
>    mov    0xc(%r12,%rbp,8),%r10d       ; #2 load array length 
> (implicit oops unpacking via x86 memory addressing)
>    mov    0x8(%rsp),%r11d              ; #3
>    cmp    %r10d,%r11d                  ; #4
>    jae    0x00007f51c7a6e766           ; #5 lines #2-#5 - range check
>    mov    0x8(%r12,%rbp,8),%r10d       ; #6 load klass ptr
>    lea    (%r12,%rbp,8),%rdi           ; #7 uncompress array oop to rdi
>    shl    $0x3,%r10                    ; #8
>    movabs $0x800000000,%r12            ; #9   0x800000000 - klass ptr 
> base
>    add    %r12,%r10                    ; #10  #8-#10 uncompress klass ptr
>    xor    %r12,%r12                    ; #11
>    mov    0x8(%r10),%r8d               ; #12 load layout helper
>    sar    $0x1d,%r8d                   ; #13
>    cmp    $0xfffffffd,%r8d             ; #14
>    jne    0x00007f51c7a6e643           ; #15
>    mov    0x10(%rdi,%r11,4),%r11d      ; #16
>    mov    %r11,%rbx                    ; #17
>    shl    $0x3,%rbx                    ; #18 finally ref from the 
> array uncompressed to rbx
>         * line #2 (range check) and line #7 are doing the same job - 
> uncompressing array oop, why do not join this actions?
>     * lines #8 and #10 and #12 - unpack klass ptr, and load layout 
> helper. Why don't do it the same way as in line #2 (single instruction)?
>     * line #12, #13, #14 check high byte of layout helper for value 
> 0xA0 (value type array). 0xA - binary 1010. Highest bit is 1 for all 
> kinds of arrays.
>       Hotspot knows statically that we have array here. No need to 
> check that bit.
>       The only bit need to be checked, that can be done with "test" 
> instruction -> save one register -> less register pressure, less 
> spilling -> less code.
>             ****
>    That was analysis of the hot aaload instruction. All memory are in 
> caches. Cold aaload behavior was also checked. Another benchmark, with 
> large amount of different arrays which can't fit into CPU cache.
>    As expected a high number of LLC misses were observed. At the same 
> moment it was proved that decommissioning arrayStorageProperties 
> didn't increase cache misses. Walking into Klass doesn't cause cache
>    misses due to the limited number of Klasses. All extra (extra in 
> comparison with baseline) cache misses are happening when markword or 
> klass prt was read.
>          3. In general performance regressions of Valhalla checks 
> caused by 3 reasons:
>       - Increased amount of instructions. More work (checks) has to be 
> done.
>       - Complex tags and masks. Having non single bit mask is not an 
> issue itself. But it always spoils a register. And causes more and 
> more register spilling (as avalanche) and may crash performance of tight
>      sensitive loop. Particularity that induced register spilling is 
> the source of regression. I will advocate for the single bit masks as 
> much as possible. As for layout helper tag -
>      3 values for 8 bits - more than enough.
>        By the way: We don't have CMS anymore. Biased locking is going 
> away. Markword became simpler. Could we find a one bit in markwork to 
> mark inline type object?
>         - More memory loads and cache misses. Unavoidable. The only 
> way - to make better and better out of loop hoisting and checks 
> elimination.
>
>
>
>
>
>
>
>
>