RFR: 8221404: C2: Convert RegMask and IndexSet to use uintptr_t

Sun Nov 8 20:47:04 UTC 2020

On Fri, 6 Nov 2020 21:55:47 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> This patch refactors RegMask and IndexSet to use uintptr_t rather than int for storage, which may shorten some code paths and loops on 64-bit VMs. Making storage unsigned further allows for a few simplification, e.g. is_bound_set where there was logic to deal with sign extension that can no longer happen.
>> 
>> To evaluate performance impact I created the included JMH microbenchmark which uses the RepeatCompilation command to repeat the compilation of a few methods: One trivial (`trivialMath`), one "regular" (`mixHashCode`), and one largish ( `largeMethod`..) with a lot of locals. These are designed to put no stress, some stress and quite a bit of stress on register allocation:
>> 
>> Baseline:
>> Benchmark                                      Mode  Cnt     Score    Error  Units
>> SimpleRepeatCompilation.largeMethod_baseline     ss   10   168.919 ±  2.839  ms/op
>> SimpleRepeatCompilation.largeMethod_repeat       ss   10  8920.305 ± 40.531  ms/op
>> SimpleRepeatCompilation.largeMethod_repeat_c1    ss   10   153.961 ±  2.762  ms/op
>> SimpleRepeatCompilation.largeMethod_repeat_c2    ss   10  8242.061 ± 71.989  ms/op
>> SimpleRepeatCompilation.mixHashCode_baseline     ss   10    69.526 ±  7.098  ms/op
>> SimpleRepeatCompilation.mixHashCode_repeat       ss   10  6733.627 ± 63.689  ms/op
>> SimpleRepeatCompilation.mixHashCode_repeat_c1    ss   10   316.862 ± 29.682  ms/op
>> SimpleRepeatCompilation.mixHashCode_repeat_c2    ss   10  4544.604 ± 57.439  ms/op
>> SimpleRepeatCompilation.trivialMath_baseline     ss   10    21.757 ±  1.553  ms/op
>> SimpleRepeatCompilation.trivialMath_repeat       ss   10   499.214 ± 35.984  ms/op
>> SimpleRepeatCompilation.trivialMath_repeat_c1    ss   10   100.345 ±  2.168  ms/op
>> SimpleRepeatCompilation.trivialMath_repeat_c2    ss   10   398.528 ±  4.718  ms/op
>> 
>> Patched:
>> Benchmark                                      Mode  Cnt     Score    Error  Units
>> SimpleRepeatCompilation.largeMethod_baseline     ss   10   164.355 ±  3.531  ms/op
>> SimpleRepeatCompilation.largeMethod_repeat       ss   10  8516.033 ± 22.408  ms/op
>> SimpleRepeatCompilation.largeMethod_repeat_c1    ss   10   151.181 ± 12.869  ms/op
>> SimpleRepeatCompilation.largeMethod_repeat_c2    ss   10  7857.373 ± 52.826  ms/op
>> SimpleRepeatCompilation.mixHashCode_baseline     ss   10    65.085 ±  5.643  ms/op
>> SimpleRepeatCompilation.mixHashCode_repeat       ss   10  6601.693 ± 57.898  ms/op
>> SimpleRepeatCompilation.mixHashCode_repeat_c1    ss   10   315.845 ± 27.474  ms/op
>> SimpleRepeatCompilation.mixHashCode_repeat_c2    ss   10  4456.847 ± 30.459  ms/op
>> SimpleRepeatCompilation.trivialMath_baseline     ss   10    21.273 ±  2.115  ms/op
>> SimpleRepeatCompilation.trivialMath_repeat       ss   10   506.873 ± 18.994  ms/op
>> SimpleRepeatCompilation.trivialMath_repeat_c1    ss   10   100.184 ±  3.008  ms/op
>> SimpleRepeatCompilation.trivialMath_repeat_c2    ss   10   397.010 ±  4.531  ms/op
>> 
>> This shows that there's no significant change on `trivialMath`, `mixHashCode` see a small improvement (~2%) and `largeMethod` see a larger improvement (~4-5%) on C2 and Tiered tests with compiler repetition.
>> 
>> Testing: tier 1-7 on all Oracle platforms, local testing and verification of linux-x86.
>
> Looks good in general.
> You may want to compare RA times from -XX:+LogCompilation to see clear difference.

Using +CITime to get a breakdown of a sample run of Regalloc times for largeMethod_repeat_c2, baseline:

    C2 Compile Time:        8.731 s
...
       Regalloc:              4.759 s
         Ctor Chaitin:          0.000 s
         Build IFG (virt):      0.190 s
         Build IFG (phys):      1.523 s
         Compute Liveness:      0.235 s
         Regalloc Split:        0.284 s
         Postalloc Copy Rem:    0.283 s
         Merge multidefs:       0.011 s
         Fixup Spills:          0.012 s
         Compact:               0.005 s
         Coalesce 1:            0.127 s
         Coalesce 2:            0.002 s
         Coalesce 3:            0.747 s
         Cache LRG:             0.005 s
         Simplify:              0.375 s
         Select:                0.423 s
         Other:                 0.536 s

Patch:
    C2 Compile Time:        8.317 s
...
       Regalloc:              4.340 s
         Ctor Chaitin:          0.000 s
         Build IFG (virt):      0.162 s
         Build IFG (phys):      1.344 s
         Compute Liveness:      0.237 s
         Regalloc Split:        0.284 s
         Postalloc Copy Rem:    0.279 s
         Merge multidefs:       0.011 s
         Fixup Spills:          0.012 s
         Compact:               0.004 s
         Coalesce 1:            0.121 s
         Coalesce 2:            0.002 s
         Coalesce 3:            0.680 s
         Cache LRG:             0.005 s
         Simplify:              0.345 s
         Select:                0.362 s
         Other:                 0.490 s

Timings appear pretty stable from run-to-run. No significant change in other phases.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1102