Performance impact of decommissioning arrayStorageProperties to legacy code.

Wed Jun 10 12:11:53 UTC 2020

Filed:

"8247298: [lworld] Review use of oopDesc & mark word for alternative 
inline type behavior"

https://bugs.openjdk.java.net/browse/JDK-8247298

Seems like we need easy to test bits in oopDesc for the following two 
cases...

1) Inline type
2) Flat inline type array

*Anything else ?*

I have some thoughts on how, can talk about today, here's a starter...

1) Single sentinel monitor, can place it low memory page fault country, 
easy to find code that needs to test for it (there is some GC code). 
That said, GC folks are working changes so they don't interact with 
monitor in mainline, sometime around JDK 16.

2) Biased locking is being deprecated, steal the 3rd last bit ("bias 
lock bit") before someone else does. Valhalla can ignore bias locking 
from now even.

Both these solutions work on 32 bit, and there is not a lot conditional 
"if arch" and "if feature" code required, relatively simple compare and 
a bit test.

Thoughts ?

/Simms

On 2020-06-10 06:52, Sergey Kuksenko wrote:
>
>   Update.
>
>   New analysis was done with modified benchmark to cover polymorphic 
> array store. Array store was mixed for array of Object, array of 
> interface, array of abstract class and array of concrete class.
>
>   Here are performance results for polymorphic array store:
>
>
>                                |baseline(ns)| v-66 (ns) | v-72 (ns) | 
> v-66/baseline | v-72/baseline |  v-72/v-66
>
> G1GC (compressedOops)          :    380     |    445    | 420    |    
> -17.1%     |   -10.5%      |    5.6%
>
> G1GC (uncompressedOops)        :    300     |    400    | 390    |    
> -33.3%     |   -30.0%      |    2.5%
>
> ParallelGC (compressedOops)    :    310     |    360    | 350    |    
> -16.1%     |   -12.9%      |    2.8%
>
> ParallelGC (uncompressedOops)  :    284     |    330    | 300    |    
> -16.2%     |    -5.6%      |    9.1%
>
> ZGC (uncompressedOops)         :    285     |    314    | 310    |    
> -10.2%     |    -8.8%      |    1.3%
>
> EpsisonGC (compressedOops)     :    284     |    340    | 320    |    
> -19.7%     |   -12.7%      |    5.9%
>
> EpsisonGC (uncompressedOops)   :    277     |    294    | 300    |     
> -6.1%     |    -8.3%      |   -2.0%
>
>
>
>   New column added - speedup v-72 over v-66.
>
>   For polymorphic array store the picture is not so bright, but anyway 
> Decommission arrayStorageProperties gives performance speedup (except 
> 1 case).
>   In case of polymorphic array store access to Klass is performed 
> always, and clearing extra bits from klass ptr has negative effect. By 
> the way, which field of Klass has offset 0xE8?
>
>   What is interesting - it's quite large difference between baseline 
> and both Valhalla versions in case of G1GC.
>   Comparing generated code of baseline and v-72 it was found two 
> differences:
>
>   1. Different layout of basic blocks (some jumps are reverted, je -> 
> jne).
>      But it shouldn't be the source of regression, profiling has shown 
> that number of branches and branch-missed the same for baseline and 
> Valhalla.
>
>   2. Access to layout helper and checking if it's array of values.
>
>      ...
>
>      mov    0x8(%r10),%r8d
>
>      mov    %edx,%r12d
>
>      sar    $0x1d,%r8d
>
>      cmp    $0xfffffffd,%r8d
>
>      je     0x00007fab202b2d96
>
>      ...
>
>
> Tobias, What do you think? Does it make sense to play with layout 
> helper? Nothing prevents us to make 1 bit tags and test & jump and 
> check what we get?
>
>
> On 6/9/20 8:13 AM, Tobias Hartmann wrote:
>> Hi Sergey,
>>
>> thanks again for the nice report! Comments below.
>>
>> On 09.06.20 06:43, Sergey Kuksenko wrote:
>>>    Note: Unroll and out of hoisting was happened only for ZGC, 
>>> ParallelGC and EpsilonGC. It was not
>>> done for G1 by unknown reason. Maybe this need attention.
>> That's unexpected. Is it the same with mainline?
>>
>>>    Decommission arrayStorageProperties has positive performance 
>>> effect for aastore operation in any
>>> conditions. The really nice fact that aastore completely doesn't 
>>> have negative performance effects
>>> for legacy code in Valhalla.   The fact is klass ptr is loaded for 
>>> every aastore operation and
>>> checked if runtime of the array is Object[] (for this benchmark it's 
>>> the simplest form of array
>>> store check). In v-66 arrayStorageProperties bits clearing should be 
>>> done.
>>>    In v-72 there are no Valhalla checks at all (we already checked 
>>> if it's Object[] - don't need to
>>> do anything else).
>> Right. This is because C2 speculates on the array being monomorphic 
>> (MomorphicArrayCheck
>> optimization) and we can then omit all inline type specific checks. 
>> Have you checked with a
>> polymorphic array store? In that case you should see flat/null-free 
>> checks and these will have an
>> impact on performance.
>>
>> Thanks,
>> Tobias