RFR: 8257772: Vectorizing clear memory operation using AVX-512 masked operations
Vladimir Kozlov
kvn at openjdk.java.net
Fri Dec 4 18:58:14 UTC 2020
On Fri, 4 Dec 2020 18:28:44 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
> A newly allocated memory is initialized either using user provided initialization values for various fields or setting the memory to zero as per java semantics (System initialization).
>
> C2 compiler creates ClearArray Node in order to perform system initialization. ClearArray accepts the number of Heap Words to be initialized, this number can be constant or a non-constant value. For constant number of heap words less than InitArrayShortSize (default value 64 bytes) currently compiler generates StoreL nodes which does the initialization at the granularity of 8 bytes.
>
> This patch vectorizes the initializing store operations for constant sized heap word less than InitArrayShortSize by emitting special instruction sequence for various tail sizes.
>
> In addition existing implementation for initialization under UseXMMForObjInit is extended to use masked operation to optimize tail initialization sequence. In case AVX3Threshold is set to 0 then new initialization sequence uses 64 byte ZMM registers.
>
> Following are the performance stats collected using micro-benchmark included with the patch.
>
> Testing : Tier1-Tier3 level tests are clean.
>
> System Configuration : Cascadelake, Intel Xeon Platinum 8280L @ 2.7 GHz, 2 socket, 28 cores per socket.
>
> Baseline:
> Benchmark Mode Cnt Score Error Units
> ClearMemory.testClearMemory3 thrpt 10 212508.522 ± 14071.493 ops/s
> ClearMemory.testClearMemory4 thrpt 10 189530.643 ± 12882.421 ops/s
> ClearMemory.testClearMemory5 thrpt 10 167878.803 ± 10307.163 ops/s
> ClearMemory.testClearMemory6 thrpt 10 152732.184 ± 8740.128 ops/s
> ClearMemory.testClearMemory7 thrpt 10 132111.536 ± 5493.043 ops/s
>
> With Optimization:
>
> Benchmark Mode Cnt Score Error Units
> ClearMemory.testClearMemory3 thrpt 10 220378.082 ± 18533.701 ops/s
> ClearMemory.testClearMemory4 thrpt 10 198023.913 ± 15995.780 ops/s
> ClearMemory.testClearMemory5 thrpt 10 183476.886 ± 13488.821 ops/s
> ClearMemory.testClearMemory6 thrpt 10 161710.750 ± 9270.182 ops/s
> ClearMemory.testClearMemory7 thrpt 10 145059.426 ± 8217.484 ops/s
src/hotspot/cpu/x86/x86_64.ad line 10856:
> 10854: %}
> 10855:
> 10856: instruct rep_stos_im(immL cnt, rRegP base, regD tmp, rRegI zero, Universe dummy, rFlagsReg cr)
What about x86_32.ad?
-------------
PR: https://git.openjdk.java.net/jdk/pull/1631
More information about the hotspot-compiler-dev
mailing list