RFR: 8257772: Vectorizing clear memory operation using AVX-512 masked operations [v4]

Thu Dec 10 07:47:40 UTC 2020

On Tue, 8 Dec 2020 18:19:24 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> A newly allocated memory is initialized either using user provided initialization values for various fields or setting the memory to zero as per java semantics (System initialization).
>> 
>> C2 compiler creates ClearArray Node in order to perform system initialization. ClearArray accepts the number of Heap Words to be initialized, this number can be constant or a non-constant value. For constant number of heap words less than InitArrayShortSize (default value 64 bytes) currently compiler generates StoreL nodes which does the initialization at the granularity of 8 bytes.
>> 
>> This patch vectorizes the initializing store operations for constant sized heap word less than InitArrayShortSize by emitting special instruction sequence for various tail sizes.
>> 
>> In addition existing implementation for initialization under UseXMMForObjInit is extended to use masked operation to optimize tail initialization sequence. In case AVX3Threshold is set to 0 then new initialization sequence uses 64 byte ZMM registers.
>> 
>> Following are the performance stats collected using  micro-benchmark included with the patch.
>> 
>> Testing : Tier1-Tier3 level tests are clean.
>> 
>> System Configuration : Cascadelake, Intel Xeon Platinum 8280L @ 2.7 GHz, 2 socket, 28 cores per socket.
>> 
>> ### Baseline:
>> Benchmark                        Mode  Cnt          Score   Error  Units
>> ClearMemory.testClearMemory16K  thrpt    2    1427741.069          ops/s
>> ClearMemory.testClearMemory1K   thrpt    2   47628368.596          ops/s
>> ClearMemory.testClearMemory1M   thrpt    2      27388.979          ops/s
>> ClearMemory.testClearMemory24B  thrpt    2  167681010.419          ops/s
>> ClearMemory.testClearMemory2K   thrpt    2   22043948.290          ops/s
>> ClearMemory.testClearMemory32B  thrpt    2  168599498.817          ops/s
>> ClearMemory.testClearMemory32K  thrpt    2     775985.067          ops/s
>> ClearMemory.testClearMemory40B  thrpt    2  153375273.800          ops/s
>> ClearMemory.testClearMemory48B  thrpt    2  145328531.804          ops/s
>> ClearMemory.testClearMemory4K   thrpt    2    6492257.452          ops/s
>> ClearMemory.testClearMemory56B  thrpt    2  122376321.652          ops/s
>> ClearMemory.testClearMemory8K   thrpt    2    2857444.413          ops/s
>> ClearMemory.testClearMemory8M   thrpt    2       3461.674          ops/s
>> ### With Optimization:
>> Benchmark                        Mode  Cnt          Score   Error  Units
>> ClearMemory.testClearMemory16K  thrpt    2    2529701.368          ops/s
>> ClearMemory.testClearMemory1K   thrpt    2   50276682.550          ops/s
>> ClearMemory.testClearMemory1M   thrpt    2      27458.588          ops/s
>> ClearMemory.testClearMemory24B  thrpt    2  178751174.642          ops/s
>> ClearMemory.testClearMemory2K   thrpt    2   22574802.694          ops/s
>> ClearMemory.testClearMemory32B  thrpt    2  176630844.950          ops/s
>> ClearMemory.testClearMemory32K  thrpt    2    1297627.181          ops/s
>> ClearMemory.testClearMemory40B  thrpt    2  167469550.653          ops/s
>> ClearMemory.testClearMemory48B  thrpt    2  159391163.006          ops/s
>> ClearMemory.testClearMemory4K   thrpt    2    9045158.643          ops/s
>> ClearMemory.testClearMemory56B  thrpt    2  134550172.421          ops/s
>> ClearMemory.testClearMemory8K   thrpt    2    4581450.664          ops/s
>> ClearMemory.testClearMemory8M   thrpt    2       3446.834          ops/s
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8257772: Changing the default value for UseXMMForObjInit and UseFastStosb flags.

Performance numbers look good. No regression and some nice improvements (up to 13%) for some crypto microbenchmarks.

Code looks reasonable to me but assembly snippets are hard to review. I've added some style comments.

@shipilev who implemented [JDK-8146801](https://bugs.openjdk.java.net/browse/JDK-8146801) might also want to have a look.

For stability, this should wait until JDK 17 repos are forked today.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4949:

> 4947:   BIND(L_loop);
> 4948:   if (MaxVectorSize >= 32) {
> 4949:     fill64_avx(base,  0, xtmp, use64byteVector);

Extra whitespace after `base,  `

test/micro/org/openjdk/bench/vm/compiler/ClearMemory.java line 45:

> 43: public class ClearMemory {
> 44:   class Payload8 {
> 45:     public long f0;

For Java code we use 4 whitespace indentation.

src/hotspot/cpu/x86/x86_64.ad line 10860:

> 10858:   predicate(!((ClearArrayNode*)n)->is_large() && n->in(2)->bottom_type()->is_long()->is_con());
> 10859:   match(Set dummy (ClearArray cnt base));
> 10860:   effect(TEMP tmp,TEMP zero,  KILL cr);

-> `(TEMP tmp, TEMP zero, KILL cr);`

src/hotspot/cpu/x86/x86_32.ad line 11553:

> 11551:   predicate(!((ClearArrayNode*)n)->is_large() && n->in(2)->bottom_type()->is_int()->is_con());
> 11552:   match(Set dummy (ClearArray cnt base));
> 11553:   effect(TEMP tmp,TEMP zero,  KILL cr);

-> `effect(TEMP tmp, TEMP zero, KILL cr);`

src/hotspot/cpu/x86/macroAssembler_x86_arrayCopy_avx3.cpp line 251:

> 249: }
> 250: 
> 251: 

Remove extra newline.

src/hotspot/cpu/x86/macroAssembler_x86.hpp line 1854:

> 1852:                   bool use64byteVector = false);
> 1853: 
> 1854: 

Remove extra newline.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 8184:

> 8182: }
> 8183: 
> 8184: 

One newline between methods is sufficient (same for at other places).

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5028:

> 5026:         }
> 5027:         break;
> 5028:       case 6:

Please order case statements by increasing value.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4937:

> 4935:   // base - start address, qword aligned.
> 4936:   Label L_zero_64_bytes, L_loop, L_sloop, L_tail, L_end;
> 4937:   bool use64byteVector = MaxVectorSize == 64 && AVX3Threshold == 0;

The comment for `AVX3Threshold` says:
  "Minimum array size in bytes to use AVX512 intrinsics"
  "for copy, inflate and fill. When this value is set as zero"
  "compare operations can also use AVX512 intrinsics.")

Should we mention clear memory there as well?

-------------

Changes requested by thartmann (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/1631