RFR: 8257772: Vectorizing clear memory operation using AVX-512 masked operations
Jatin Bhateja
jbhateja at openjdk.java.net
Fri Dec 4 18:34:24 UTC 2020
A newly allocated memory is initialized either using user provided initialization values for various fields or setting the memory to zero as per java semantics (System initialization).
C2 compiler creates ClearArray Node in order to perform system initialization. ClearArray accepts the number of Heap Words to be initialized, this number can be constant or a non-constant value. For constant number of heap words less than InitArrayShortSize (default value 64 bytes) currently compiler generates StoreL nodes which does the initialization at the granularity of 8 bytes.
This patch vectorizes the initializing store operations for constant sized heap word less than InitArrayShortSize by emitting special instruction sequence for various tail sizes.
In addition existing implementation for initialization under UseXMMForObjInit is extended to use masked operation to optimize tail initialization sequence. In case AVX3Threshold is set to 0 then new initialization sequence uses 64 byte ZMM registers.
Following are the performance stats collected using micro-benchmark included with the patch.
Testing : Tier1-Tier3 level tests are clean.
System Configuration : Cascadelake, Intel Xeon Platinum 8280L @ 2.7 GHz, 2 socket, 28 cores per socket.
Baseline:
Benchmark Mode Cnt Score Error Units
ClearMemory.testClearMemory3 thrpt 10 212508.522 ± 14071.493 ops/s
ClearMemory.testClearMemory4 thrpt 10 189530.643 ± 12882.421 ops/s
ClearMemory.testClearMemory5 thrpt 10 167878.803 ± 10307.163 ops/s
ClearMemory.testClearMemory6 thrpt 10 152732.184 ± 8740.128 ops/s
ClearMemory.testClearMemory7 thrpt 10 132111.536 ± 5493.043 ops/s
With Optimization:
Benchmark Mode Cnt Score Error Units
ClearMemory.testClearMemory3 thrpt 10 220378.082 ± 18533.701 ops/s
ClearMemory.testClearMemory4 thrpt 10 198023.913 ± 15995.780 ops/s
ClearMemory.testClearMemory5 thrpt 10 183476.886 ± 13488.821 ops/s
ClearMemory.testClearMemory6 thrpt 10 161710.750 ± 9270.182 ops/s
ClearMemory.testClearMemory7 thrpt 10 145059.426 ± 8217.484 ops/s
-------------
Commit messages:
- 8257772: Vectorizing clear memory operation using AVX-512 masked operations
Changes: https://git.openjdk.java.net/jdk/pull/1631/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1631&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8257772
Stats: 349 lines in 7 files changed: 321 ins; 3 del; 25 mod
Patch: https://git.openjdk.java.net/jdk/pull/1631.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/1631/head:pull/1631
PR: https://git.openjdk.java.net/jdk/pull/1631
More information about the hotspot-compiler-dev
mailing list