RFR: 8257772: Vectorizing clear memory operation using AVX-512 masked operations [v4]
Tobias Hartmann
thartmann at openjdk.java.net
Thu Dec 10 07:47:40 UTC 2020
On Tue, 8 Dec 2020 18:19:24 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> A newly allocated memory is initialized either using user provided initialization values for various fields or setting the memory to zero as per java semantics (System initialization).
>>
>> C2 compiler creates ClearArray Node in order to perform system initialization. ClearArray accepts the number of Heap Words to be initialized, this number can be constant or a non-constant value. For constant number of heap words less than InitArrayShortSize (default value 64 bytes) currently compiler generates StoreL nodes which does the initialization at the granularity of 8 bytes.
>>
>> This patch vectorizes the initializing store operations for constant sized heap word less than InitArrayShortSize by emitting special instruction sequence for various tail sizes.
>>
>> In addition existing implementation for initialization under UseXMMForObjInit is extended to use masked operation to optimize tail initialization sequence. In case AVX3Threshold is set to 0 then new initialization sequence uses 64 byte ZMM registers.
>>
>> Following are the performance stats collected using micro-benchmark included with the patch.
>>
>> Testing : Tier1-Tier3 level tests are clean.
>>
>> System Configuration : Cascadelake, Intel Xeon Platinum 8280L @ 2.7 GHz, 2 socket, 28 cores per socket.
>>
>> ### Baseline:
>> Benchmark Mode Cnt Score Error Units
>> ClearMemory.testClearMemory16K thrpt 2 1427741.069 ops/s
>> ClearMemory.testClearMemory1K thrpt 2 47628368.596 ops/s
>> ClearMemory.testClearMemory1M thrpt 2 27388.979 ops/s
>> ClearMemory.testClearMemory24B thrpt 2 167681010.419 ops/s
>> ClearMemory.testClearMemory2K thrpt 2 22043948.290 ops/s
>> ClearMemory.testClearMemory32B thrpt 2 168599498.817 ops/s
>> ClearMemory.testClearMemory32K thrpt 2 775985.067 ops/s
>> ClearMemory.testClearMemory40B thrpt 2 153375273.800 ops/s
>> ClearMemory.testClearMemory48B thrpt 2 145328531.804 ops/s
>> ClearMemory.testClearMemory4K thrpt 2 6492257.452 ops/s
>> ClearMemory.testClearMemory56B thrpt 2 122376321.652 ops/s
>> ClearMemory.testClearMemory8K thrpt 2 2857444.413 ops/s
>> ClearMemory.testClearMemory8M thrpt 2 3461.674 ops/s
>> ### With Optimization:
>> Benchmark Mode Cnt Score Error Units
>> ClearMemory.testClearMemory16K thrpt 2 2529701.368 ops/s
>> ClearMemory.testClearMemory1K thrpt 2 50276682.550 ops/s
>> ClearMemory.testClearMemory1M thrpt 2 27458.588 ops/s
>> ClearMemory.testClearMemory24B thrpt 2 178751174.642 ops/s
>> ClearMemory.testClearMemory2K thrpt 2 22574802.694 ops/s
>> ClearMemory.testClearMemory32B thrpt 2 176630844.950 ops/s
>> ClearMemory.testClearMemory32K thrpt 2 1297627.181 ops/s
>> ClearMemory.testClearMemory40B thrpt 2 167469550.653 ops/s
>> ClearMemory.testClearMemory48B thrpt 2 159391163.006 ops/s
>> ClearMemory.testClearMemory4K thrpt 2 9045158.643 ops/s
>> ClearMemory.testClearMemory56B thrpt 2 134550172.421 ops/s
>> ClearMemory.testClearMemory8K thrpt 2 4581450.664 ops/s
>> ClearMemory.testClearMemory8M thrpt 2 3446.834 ops/s
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> 8257772: Changing the default value for UseXMMForObjInit and UseFastStosb flags.
Performance numbers look good. No regression and some nice improvements (up to 13%) for some crypto microbenchmarks.
Code looks reasonable to me but assembly snippets are hard to review. I've added some style comments.
@shipilev who implemented [JDK-8146801](https://bugs.openjdk.java.net/browse/JDK-8146801) might also want to have a look.
For stability, this should wait until JDK 17 repos are forked today.
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4949:
> 4947: BIND(L_loop);
> 4948: if (MaxVectorSize >= 32) {
> 4949: fill64_avx(base, 0, xtmp, use64byteVector);
Extra whitespace after `base, `
test/micro/org/openjdk/bench/vm/compiler/ClearMemory.java line 45:
> 43: public class ClearMemory {
> 44: class Payload8 {
> 45: public long f0;
For Java code we use 4 whitespace indentation.
src/hotspot/cpu/x86/x86_64.ad line 10860:
> 10858: predicate(!((ClearArrayNode*)n)->is_large() && n->in(2)->bottom_type()->is_long()->is_con());
> 10859: match(Set dummy (ClearArray cnt base));
> 10860: effect(TEMP tmp,TEMP zero, KILL cr);
-> `(TEMP tmp, TEMP zero, KILL cr);`
src/hotspot/cpu/x86/x86_32.ad line 11553:
> 11551: predicate(!((ClearArrayNode*)n)->is_large() && n->in(2)->bottom_type()->is_int()->is_con());
> 11552: match(Set dummy (ClearArray cnt base));
> 11553: effect(TEMP tmp,TEMP zero, KILL cr);
-> `effect(TEMP tmp, TEMP zero, KILL cr);`
src/hotspot/cpu/x86/macroAssembler_x86_arrayCopy_avx3.cpp line 251:
> 249: }
> 250:
> 251:
Remove extra newline.
src/hotspot/cpu/x86/macroAssembler_x86.hpp line 1854:
> 1852: bool use64byteVector = false);
> 1853:
> 1854:
Remove extra newline.
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 8184:
> 8182: }
> 8183:
> 8184:
One newline between methods is sufficient (same for at other places).
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5028:
> 5026: }
> 5027: break;
> 5028: case 6:
Please order case statements by increasing value.
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4937:
> 4935: // base - start address, qword aligned.
> 4936: Label L_zero_64_bytes, L_loop, L_sloop, L_tail, L_end;
> 4937: bool use64byteVector = MaxVectorSize == 64 && AVX3Threshold == 0;
The comment for `AVX3Threshold` says:
"Minimum array size in bytes to use AVX512 intrinsics"
"for copy, inflate and fill. When this value is set as zero"
"compare operations can also use AVX512 intrinsics.")
Should we mention clear memory there as well?
-------------
Changes requested by thartmann (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/1631
More information about the hotspot-compiler-dev
mailing list