RFR: 8257772: Vectorizing clear memory operation using AVX-512 masked operations

Vladimir Kozlov kvn at openjdk.java.net
Fri Dec 4 18:58:14 UTC 2020


On Fri, 4 Dec 2020 18:28:44 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> A newly allocated memory is initialized either using user provided initialization values for various fields or setting the memory to zero as per java semantics (System initialization).
> 
> C2 compiler creates ClearArray Node in order to perform system initialization. ClearArray accepts the number of Heap Words to be initialized, this number can be constant or a non-constant value. For constant number of heap words less than InitArrayShortSize (default value 64 bytes) currently compiler generates StoreL nodes which does the initialization at the granularity of 8 bytes.
> 
> This patch vectorizes the initializing store operations for constant sized heap word less than InitArrayShortSize by emitting special instruction sequence for various tail sizes.
> 
> In addition existing implementation for initialization under UseXMMForObjInit is extended to use masked operation to optimize tail initialization sequence. In case AVX3Threshold is set to 0 then new initialization sequence uses 64 byte ZMM registers.
> 
> Following are the performance stats collected using  micro-benchmark included with the patch.
> 
> Testing : Tier1-Tier3 level tests are clean.
> 
> System Configuration : Cascadelake, Intel Xeon Platinum 8280L @ 2.7 GHz, 2 socket, 28 cores per socket.
> 
> Baseline:
> Benchmark                      Mode  Cnt       Score       Error  Units
> ClearMemory.testClearMemory3  thrpt   10  212508.522 ± 14071.493  ops/s
> ClearMemory.testClearMemory4  thrpt   10  189530.643 ± 12882.421  ops/s
> ClearMemory.testClearMemory5  thrpt   10  167878.803 ± 10307.163  ops/s
> ClearMemory.testClearMemory6  thrpt   10  152732.184 ±  8740.128  ops/s
> ClearMemory.testClearMemory7  thrpt   10  132111.536 ±  5493.043  ops/s
> 
> With Optimization:
> 
> Benchmark                      Mode  Cnt       Score       Error  Units
> ClearMemory.testClearMemory3  thrpt   10  220378.082 ± 18533.701  ops/s
> ClearMemory.testClearMemory4  thrpt   10  198023.913 ± 15995.780  ops/s
> ClearMemory.testClearMemory5  thrpt   10  183476.886 ± 13488.821  ops/s
> ClearMemory.testClearMemory6  thrpt   10  161710.750 ±  9270.182  ops/s
> ClearMemory.testClearMemory7  thrpt   10  145059.426 ±  8217.484  ops/s

src/hotspot/cpu/x86/x86_64.ad line 10856:

> 10854: %}
> 10855: 
> 10856: instruct rep_stos_im(immL cnt, rRegP base, regD tmp, rRegI zero, Universe dummy, rFlagsReg cr)

What about x86_32.ad?

-------------

PR: https://git.openjdk.java.net/jdk/pull/1631


More information about the hotspot-compiler-dev mailing list