RFR: 8270947: AArch64: C1: use zero_words to initialize all objects [v3]
Nick Gasson
ngasson at openjdk.java.net
Fri Jul 30 07:31:31 UTC 2021
On Thu, 29 Jul 2021 16:18:58 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> C1 has its own code generators for zeroing words. We should use the same logic for C1 and C2, which should give us better C1 performance and result in less code to maintain.
>>
>> This is one of those patches that's a great joy to write, because it consists mainly of deletions. The code I've added is mostly adapters to allow the C1 code to use the memory-zeroing logic written originally for C2. This means we have less code, but also that VM configuration options (e.g. `BlockZeroingLowLimit`) work with C1 and C2 in th esame way.
>>
>> Measuring the performance of memory allocation is quite tricky, so I've written a JMH test case that measures the raw allocation rate of the JVM for various object sizes. This is inevitably rather noisy because it combines the effects of both the allocation code and other GC-related pauses. Nonetheless, it's a useful sanity check.
>>
>> The performance differences between old and one are mostly in the noise, but with large allocations the advantage of `DC ZVA` becomes apparent:
>>
>> old:
>>
>> RawAllocationRate.arrayTest_C1 8192 thrpt 5 11220.314 ± 336.878 ops/s
>> RawAllocationRate.arrayTest_C1 16384 thrpt 5 16655.815 ± 88.577 ops/s
>> RawAllocationRate.arrayTest_C1 65536 thrpt 5 28302.661 ± 155.513 ops/s
>> RawAllocationRate.arrayTest_C1 131072 thrpt 5 31434.868 ± 211.768 ops/s
>>
>> new:
>>
>> RawAllocationRate.arrayTest_C1 8192 thrpt 5 13677.987 ± 143.048 ops/s
>> RawAllocationRate.arrayTest_C1 16384 thrpt 5 19517.416 ± 155.004 ops/s
>> RawAllocationRate.arrayTest_C1 65536 thrpt 5 37348.536 ± 307.582 ops/s
>> RawAllocationRate.arrayTest_C1 131072 thrpt 5 43414.399 ± 58.317 ops/s
>>
>>
>> Full test results, Graviton 2 (i.e. Neoverse N1). Units are megabytes per second,
>> objects sizes are in bytes:
>>
>>
>> old:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> RawAllocationRate.arrayTest 32 thrpt 5 5092.798 ± 20.879 ops/s
>> RawAllocationRate.arrayTest 64 thrpt 5 9821.608 ± 6.250 ops/s
>> RawAllocationRate.arrayTest 256 thrpt 5 14117.192 ± 72.720 ops/s
>> RawAllocationRate.arrayTest 1024 thrpt 5 9090.514 ± 40.239 ops/s
>> RawAllocationRate.arrayTest 2048 thrpt 5 9842.503 ± 52.744 ops/s
>> RawAllocationRate.arrayTest 4096 thrpt 5 9866.179 ± 6.332 ops/s
>> RawAllocationRate.arrayTest 8192 thrpt 5 12836.968 ± 14.143 ops/s
>> RawAllocationRate.arrayTest 16384 thrpt 5 18970.307 ± 96.903 ops/s
>> RawAllocationRate.arrayTest 65536 thrpt 5 36709.095 ± 38.256 ops/s
>> RawAllocationRate.arrayTest 131072 thrpt 5 43055.263 ± 60.808 ops/s
>> RawAllocationRate.arrayTest_C1 32 thrpt 5 3045.285 ± 23.128 ops/s
>> RawAllocationRate.arrayTest_C1 64 thrpt 5 5774.157 ± 52.472 ops/s
>> RawAllocationRate.arrayTest_C1 256 thrpt 5 4720.713 ± 9.419 ops/s
>> RawAllocationRate.arrayTest_C1 1024 thrpt 5 7457.880 ± 806.208 ops/s
>> RawAllocationRate.arrayTest_C1 2048 thrpt 5 8155.046 ± 194.153 ops/s
>> RawAllocationRate.arrayTest_C1 4096 thrpt 5 8364.379 ± 127.661 ops/s
>> RawAllocationRate.arrayTest_C1 8192 thrpt 5 11220.314 ± 336.878 ops/s
>> RawAllocationRate.arrayTest_C1 16384 thrpt 5 16655.815 ± 88.577 ops/s
>> RawAllocationRate.arrayTest_C1 65536 thrpt 5 28302.661 ± 155.513 ops/s
>> RawAllocationRate.arrayTest_C1 131072 thrpt 5 31434.868 ± 211.768 ops/s
>> RawAllocationRate.instanceTest 32 thrpt 5 6667.433 ± 50.031 ops/s
>> RawAllocationRate.instanceTest 64 thrpt 5 10669.876 ± 72.109 ops/s
>> RawAllocationRate.instanceTest 256 thrpt 5 5483.582 ± 336.743 ops/s
>> RawAllocationRate.instanceTest 1024 thrpt 5 9740.872 ± 6.269 ops/s
>> RawAllocationRate.instanceTest 2048 thrpt 5 9868.685 ± 51.939 ops/s
>> RawAllocationRate.instanceTest 4096 thrpt 5 9881.944 ± 46.306 ops/s
>> RawAllocationRate.instanceTest 8192 thrpt 5 13524.791 ± 69.250 ops/s
>> RawAllocationRate.instanceTest 16384 thrpt 5 19560.774 ± 109.518 ops/s
>> RawAllocationRate.instanceTest 65536 thrpt 5 37510.256 ± 15.586 ops/s
>> RawAllocationRate.instanceTest 131072 thrpt 5 43361.887 ± 181.294 ops/s
>> RawAllocationRate.instanceTest_C1 32 thrpt 5 2851.135 ± 22.891 ops/s
>> RawAllocationRate.instanceTest_C1 64 thrpt 5 5476.183 ± 84.376 ops/s
>> RawAllocationRate.instanceTest_C1 256 thrpt 5 5105.347 ± 35.389 ops/s
>> RawAllocationRate.instanceTest_C1 1024 thrpt 5 7380.805 ± 3.944 ops/s
>> RawAllocationRate.instanceTest_C1 2048 thrpt 5 8963.428 ± 83.857 ops/s
>> RawAllocationRate.instanceTest_C1 4096 thrpt 5 9257.715 ± 52.647 ops/s
>> RawAllocationRate.instanceTest_C1 8192 thrpt 5 11655.359 ± 70.209 ops/s
>> RawAllocationRate.instanceTest_C1 16384 thrpt 5 17084.813 ± 91.150 ops/s
>> RawAllocationRate.instanceTest_C1 65536 thrpt 5 28682.783 ± 176.563 ops/s
>> RawAllocationRate.instanceTest_C1 131072 thrpt 5 31268.318 ± 221.486 ops/s
>>
>> new:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> RawAllocationRate.arrayTest 32 thrpt 5 5355.477 ± 43.045 ops/s
>> RawAllocationRate.arrayTest 64 thrpt 5 9825.067 ± 55.493 ops/s
>> RawAllocationRate.arrayTest 256 thrpt 5 13984.865 ± 125.125 ops/s
>> RawAllocationRate.arrayTest 1024 thrpt 5 9025.380 ± 48.921 ops/s
>> RawAllocationRate.arrayTest 2048 thrpt 5 9844.463 ± 6.780 ops/s
>> RawAllocationRate.arrayTest 4096 thrpt 5 9866.566 ± 48.659 ops/s
>> RawAllocationRate.arrayTest 8192 thrpt 5 12753.622 ± 67.211 ops/s
>> RawAllocationRate.arrayTest 16384 thrpt 5 18890.419 ± 14.152 ops/s
>> RawAllocationRate.arrayTest 65536 thrpt 5 37322.124 ± 269.352 ops/s
>> RawAllocationRate.arrayTest 131072 thrpt 5 43017.952 ± 204.057 ops/s
>> RawAllocationRate.arrayTest_C1 32 thrpt 5 3102.221 ± 13.811 ops/s
>> RawAllocationRate.arrayTest_C1 64 thrpt 5 5947.419 ± 36.408 ops/s
>> RawAllocationRate.arrayTest_C1 256 thrpt 5 5124.479 ± 548.617 ops/s
>> RawAllocationRate.arrayTest_C1 1024 thrpt 5 9459.376 ± 716.317 ops/s
>> RawAllocationRate.arrayTest_C1 2048 thrpt 5 9840.594 ± 15.922 ops/s
>> RawAllocationRate.arrayTest_C1 4096 thrpt 5 9860.274 ± 56.088 ops/s
>> RawAllocationRate.arrayTest_C1 8192 thrpt 5 13677.987 ± 143.048 ops/s
>> RawAllocationRate.arrayTest_C1 16384 thrpt 5 19517.416 ± 155.004 ops/s
>> RawAllocationRate.arrayTest_C1 65536 thrpt 5 37348.536 ± 307.582 ops/s
>> RawAllocationRate.arrayTest_C1 131072 thrpt 5 43414.399 ± 58.317 ops/s
>> RawAllocationRate.instanceTest 32 thrpt 5 6620.452 ± 137.048 ops/s
>> RawAllocationRate.instanceTest 64 thrpt 5 9850.677 ± 6.417 ops/s
>> RawAllocationRate.instanceTest 256 thrpt 5 5533.512 ± 129.334 ops/s
>> RawAllocationRate.instanceTest 1024 thrpt 5 9829.806 ± 7.555 ops/s
>> RawAllocationRate.instanceTest 2048 thrpt 5 9857.707 ± 51.541 ops/s
>> RawAllocationRate.instanceTest 4096 thrpt 5 9957.300 ± 7.115 ops/s
>> RawAllocationRate.instanceTest 8192 thrpt 5 13662.581 ± 85.225 ops/s
>> RawAllocationRate.instanceTest 16384 thrpt 5 19571.796 ± 120.962 ops/s
>> RawAllocationRate.instanceTest 65536 thrpt 5 37401.527 ± 67.260 ops/s
>> RawAllocationRate.instanceTest 131072 thrpt 5 43327.339 ± 35.077 ops/s
>> RawAllocationRate.instanceTest_C1 32 thrpt 5 2842.031 ± 47.924 ops/s
>> RawAllocationRate.instanceTest_C1 64 thrpt 5 5359.357 ± 53.031 ops/s
>> RawAllocationRate.instanceTest_C1 256 thrpt 5 5081.287 ± 57.737 ops/s
>> RawAllocationRate.instanceTest_C1 1024 thrpt 5 8372.330 ± 267.016 ops/s
>> RawAllocationRate.instanceTest_C1 2048 thrpt 5 9470.224 ± 250.706 ops/s
>> RawAllocationRate.instanceTest_C1 4096 thrpt 5 9843.936 ± 52.825 ops/s
>> RawAllocationRate.instanceTest_C1 8192 thrpt 5 13695.863 ± 80.433 ops/s
>> RawAllocationRate.instanceTest_C1 16384 thrpt 5 19495.110 ± 116.300 ops/s
>> RawAllocationRate.instanceTest_C1 65536 thrpt 5 37448.948 ± 291.917 ops/s
>> RawAllocationRate.instanceTest_C1 131072 thrpt 5 43443.406 ± 267.236 ops/s
>
> Andrew Haley has updated the pull request incrementally with two additional commits since the last revision:
>
> - Tidy up register temps in C1 stubs that call initialize_body()
> - Don't use a trampoline call to zero_blocks in C1 compiles
Looks good to me and I've tested tier1 with -XX:TieredStopAtLevel=1. Although you probably ought to update the copyright year in c1_MacroAssembler_aarch64.hpp.
-------------
Marked as reviewed by ngasson (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/4919
More information about the hotspot-dev
mailing list