RFR: 8270947: AArch64: C1: use zero_words to initialize all objects [v3]

Nick Gasson ngasson at openjdk.java.net
Fri Jul 30 07:31:31 UTC 2021


On Thu, 29 Jul 2021 16:18:58 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> C1 has its own code generators for zeroing words. We should use the same logic for C1 and C2, which should give us better C1 performance and result in less code to maintain.
>> 
>> This is one of those patches that's a great joy to write, because it consists mainly of deletions. The code I've added is mostly adapters to allow the C1 code to use the memory-zeroing logic written originally for C2. This means we have less code, but also that VM configuration options (e.g. `BlockZeroingLowLimit`) work with C1 and C2 in th esame way.
>> 
>> Measuring the performance of memory allocation is quite tricky, so I've written a JMH test case that measures the raw allocation rate of the JVM for various object sizes. This is inevitably rather noisy because it combines the effects of both the allocation code and other GC-related pauses. Nonetheless, it's a useful sanity check.
>> 
>> The performance differences between old and one are mostly in the noise, but with large allocations the advantage of `DC ZVA` becomes apparent:
>> 
>> old:
>> 
>> RawAllocationRate.arrayTest_C1       8192  thrpt    5  11220.314 ± 336.878  ops/s
>> RawAllocationRate.arrayTest_C1      16384  thrpt    5  16655.815 ±  88.577  ops/s
>> RawAllocationRate.arrayTest_C1      65536  thrpt    5  28302.661 ± 155.513  ops/s
>> RawAllocationRate.arrayTest_C1     131072  thrpt    5  31434.868 ± 211.768  ops/s
>> 
>> new:
>> 
>> RawAllocationRate.arrayTest_C1       8192  thrpt    5  13677.987 ± 143.048  ops/s
>> RawAllocationRate.arrayTest_C1      16384  thrpt    5  19517.416 ± 155.004  ops/s
>> RawAllocationRate.arrayTest_C1      65536  thrpt    5  37348.536 ± 307.582  ops/s
>> RawAllocationRate.arrayTest_C1     131072  thrpt    5  43414.399 ±  58.317  ops/s
>> 
>> 
>> Full test results, Graviton 2 (i.e. Neoverse N1). Units are megabytes per second,
>> objects sizes are in bytes:
>> 
>> 
>> old:
>> 
>> Benchmark                          (size)   Mode  Cnt      Score     Error  Units
>> RawAllocationRate.arrayTest            32  thrpt    5   5092.798 ±  20.879  ops/s
>> RawAllocationRate.arrayTest            64  thrpt    5   9821.608 ±   6.250  ops/s
>> RawAllocationRate.arrayTest           256  thrpt    5  14117.192 ±  72.720  ops/s
>> RawAllocationRate.arrayTest          1024  thrpt    5   9090.514 ±  40.239  ops/s
>> RawAllocationRate.arrayTest          2048  thrpt    5   9842.503 ±  52.744  ops/s
>> RawAllocationRate.arrayTest          4096  thrpt    5   9866.179 ±   6.332  ops/s
>> RawAllocationRate.arrayTest          8192  thrpt    5  12836.968 ±  14.143  ops/s
>> RawAllocationRate.arrayTest         16384  thrpt    5  18970.307 ±  96.903  ops/s
>> RawAllocationRate.arrayTest         65536  thrpt    5  36709.095 ±  38.256  ops/s
>> RawAllocationRate.arrayTest        131072  thrpt    5  43055.263 ±  60.808  ops/s
>> RawAllocationRate.arrayTest_C1         32  thrpt    5   3045.285 ±  23.128  ops/s
>> RawAllocationRate.arrayTest_C1         64  thrpt    5   5774.157 ±  52.472  ops/s
>> RawAllocationRate.arrayTest_C1        256  thrpt    5   4720.713 ±   9.419  ops/s
>> RawAllocationRate.arrayTest_C1       1024  thrpt    5   7457.880 ± 806.208  ops/s
>> RawAllocationRate.arrayTest_C1       2048  thrpt    5   8155.046 ± 194.153  ops/s
>> RawAllocationRate.arrayTest_C1       4096  thrpt    5   8364.379 ± 127.661  ops/s
>> RawAllocationRate.arrayTest_C1       8192  thrpt    5  11220.314 ± 336.878  ops/s
>> RawAllocationRate.arrayTest_C1      16384  thrpt    5  16655.815 ±  88.577  ops/s
>> RawAllocationRate.arrayTest_C1      65536  thrpt    5  28302.661 ± 155.513  ops/s
>> RawAllocationRate.arrayTest_C1     131072  thrpt    5  31434.868 ± 211.768  ops/s
>> RawAllocationRate.instanceTest         32  thrpt    5   6667.433 ±  50.031  ops/s
>> RawAllocationRate.instanceTest         64  thrpt    5  10669.876 ±  72.109  ops/s
>> RawAllocationRate.instanceTest        256  thrpt    5   5483.582 ± 336.743  ops/s
>> RawAllocationRate.instanceTest       1024  thrpt    5   9740.872 ±   6.269  ops/s
>> RawAllocationRate.instanceTest       2048  thrpt    5   9868.685 ±  51.939  ops/s
>> RawAllocationRate.instanceTest       4096  thrpt    5   9881.944 ±  46.306  ops/s
>> RawAllocationRate.instanceTest       8192  thrpt    5  13524.791 ±  69.250  ops/s
>> RawAllocationRate.instanceTest      16384  thrpt    5  19560.774 ± 109.518  ops/s
>> RawAllocationRate.instanceTest      65536  thrpt    5  37510.256 ±  15.586  ops/s
>> RawAllocationRate.instanceTest     131072  thrpt    5  43361.887 ± 181.294  ops/s
>> RawAllocationRate.instanceTest_C1      32  thrpt    5   2851.135 ±  22.891  ops/s
>> RawAllocationRate.instanceTest_C1      64  thrpt    5   5476.183 ±  84.376  ops/s
>> RawAllocationRate.instanceTest_C1     256  thrpt    5   5105.347 ±  35.389  ops/s
>> RawAllocationRate.instanceTest_C1    1024  thrpt    5   7380.805 ±   3.944  ops/s
>> RawAllocationRate.instanceTest_C1    2048  thrpt    5   8963.428 ±  83.857  ops/s
>> RawAllocationRate.instanceTest_C1    4096  thrpt    5   9257.715 ±  52.647  ops/s
>> RawAllocationRate.instanceTest_C1    8192  thrpt    5  11655.359 ±  70.209  ops/s
>> RawAllocationRate.instanceTest_C1   16384  thrpt    5  17084.813 ±  91.150  ops/s
>> RawAllocationRate.instanceTest_C1   65536  thrpt    5  28682.783 ± 176.563  ops/s
>> RawAllocationRate.instanceTest_C1  131072  thrpt    5  31268.318 ± 221.486  ops/s
>> 
>> new:
>> 
>> Benchmark                          (size)   Mode  Cnt      Score     Error  Units
>> RawAllocationRate.arrayTest            32  thrpt    5   5355.477 ±  43.045  ops/s
>> RawAllocationRate.arrayTest            64  thrpt    5   9825.067 ±  55.493  ops/s
>> RawAllocationRate.arrayTest           256  thrpt    5  13984.865 ± 125.125  ops/s
>> RawAllocationRate.arrayTest          1024  thrpt    5   9025.380 ±  48.921  ops/s
>> RawAllocationRate.arrayTest          2048  thrpt    5   9844.463 ±   6.780  ops/s
>> RawAllocationRate.arrayTest          4096  thrpt    5   9866.566 ±  48.659  ops/s
>> RawAllocationRate.arrayTest          8192  thrpt    5  12753.622 ±  67.211  ops/s
>> RawAllocationRate.arrayTest         16384  thrpt    5  18890.419 ±  14.152  ops/s
>> RawAllocationRate.arrayTest         65536  thrpt    5  37322.124 ± 269.352  ops/s
>> RawAllocationRate.arrayTest        131072  thrpt    5  43017.952 ± 204.057  ops/s
>> RawAllocationRate.arrayTest_C1         32  thrpt    5   3102.221 ±  13.811  ops/s
>> RawAllocationRate.arrayTest_C1         64  thrpt    5   5947.419 ±  36.408  ops/s
>> RawAllocationRate.arrayTest_C1        256  thrpt    5   5124.479 ± 548.617  ops/s
>> RawAllocationRate.arrayTest_C1       1024  thrpt    5   9459.376 ± 716.317  ops/s
>> RawAllocationRate.arrayTest_C1       2048  thrpt    5   9840.594 ±  15.922  ops/s
>> RawAllocationRate.arrayTest_C1       4096  thrpt    5   9860.274 ±  56.088  ops/s
>> RawAllocationRate.arrayTest_C1       8192  thrpt    5  13677.987 ± 143.048  ops/s
>> RawAllocationRate.arrayTest_C1      16384  thrpt    5  19517.416 ± 155.004  ops/s
>> RawAllocationRate.arrayTest_C1      65536  thrpt    5  37348.536 ± 307.582  ops/s
>> RawAllocationRate.arrayTest_C1     131072  thrpt    5  43414.399 ±  58.317  ops/s
>> RawAllocationRate.instanceTest         32  thrpt    5   6620.452 ± 137.048  ops/s
>> RawAllocationRate.instanceTest         64  thrpt    5   9850.677 ±   6.417  ops/s
>> RawAllocationRate.instanceTest        256  thrpt    5   5533.512 ± 129.334  ops/s
>> RawAllocationRate.instanceTest       1024  thrpt    5   9829.806 ±   7.555  ops/s
>> RawAllocationRate.instanceTest       2048  thrpt    5   9857.707 ±  51.541  ops/s
>> RawAllocationRate.instanceTest       4096  thrpt    5   9957.300 ±   7.115  ops/s
>> RawAllocationRate.instanceTest       8192  thrpt    5  13662.581 ±  85.225  ops/s
>> RawAllocationRate.instanceTest      16384  thrpt    5  19571.796 ± 120.962  ops/s
>> RawAllocationRate.instanceTest      65536  thrpt    5  37401.527 ±  67.260  ops/s
>> RawAllocationRate.instanceTest     131072  thrpt    5  43327.339 ±  35.077  ops/s
>> RawAllocationRate.instanceTest_C1      32  thrpt    5   2842.031 ±  47.924  ops/s
>> RawAllocationRate.instanceTest_C1      64  thrpt    5   5359.357 ±  53.031  ops/s
>> RawAllocationRate.instanceTest_C1     256  thrpt    5   5081.287 ±  57.737  ops/s
>> RawAllocationRate.instanceTest_C1    1024  thrpt    5   8372.330 ± 267.016  ops/s
>> RawAllocationRate.instanceTest_C1    2048  thrpt    5   9470.224 ± 250.706  ops/s
>> RawAllocationRate.instanceTest_C1    4096  thrpt    5   9843.936 ±  52.825  ops/s
>> RawAllocationRate.instanceTest_C1    8192  thrpt    5  13695.863 ±  80.433  ops/s
>> RawAllocationRate.instanceTest_C1   16384  thrpt    5  19495.110 ± 116.300  ops/s
>> RawAllocationRate.instanceTest_C1   65536  thrpt    5  37448.948 ± 291.917  ops/s
>> RawAllocationRate.instanceTest_C1  131072  thrpt    5  43443.406 ± 267.236  ops/s
>
> Andrew Haley has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Tidy up register temps in C1 stubs that call initialize_body()
>  - Don't use a trampoline call to zero_blocks in C1 compiles

Looks good to me and I've tested tier1 with -XX:TieredStopAtLevel=1. Although you probably ought to update the copyright year in c1_MacroAssembler_aarch64.hpp.

-------------

Marked as reviewed by ngasson (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/4919


More information about the hotspot-dev mailing list