RFR: 8257772: Vectorizing clear memory operation using AVX-512 masked operations [v2]

Tobias Hartmann thartmann at openjdk.java.net
Mon Dec 7 08:43:14 UTC 2020


On Sat, 5 Dec 2020 18:07:28 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> A newly allocated memory is initialized either using user provided initialization values for various fields or setting the memory to zero as per java semantics (System initialization).
>> 
>> C2 compiler creates ClearArray Node in order to perform system initialization. ClearArray accepts the number of Heap Words to be initialized, this number can be constant or a non-constant value. For constant number of heap words less than InitArrayShortSize (default value 64 bytes) currently compiler generates StoreL nodes which does the initialization at the granularity of 8 bytes.
>> 
>> This patch vectorizes the initializing store operations for constant sized heap word less than InitArrayShortSize by emitting special instruction sequence for various tail sizes.
>> 
>> In addition existing implementation for initialization under UseXMMForObjInit is extended to use masked operation to optimize tail initialization sequence. In case AVX3Threshold is set to 0 then new initialization sequence uses 64 byte ZMM registers.
>> 
>> Following are the performance stats collected using  micro-benchmark included with the patch.
>> 
>> Testing : Tier1-Tier3 level tests are clean.
>> 
>> System Configuration : Cascadelake, Intel Xeon Platinum 8280L @ 2.7 GHz, 2 socket, 28 cores per socket.
>> 
>> Baseline:
>> Benchmark                      Mode  Cnt       Score       Error  Units
>> ClearMemory.testClearMemory3  thrpt   10  212508.522 ± 14071.493  ops/s
>> ClearMemory.testClearMemory4  thrpt   10  189530.643 ± 12882.421  ops/s
>> ClearMemory.testClearMemory5  thrpt   10  167878.803 ± 10307.163  ops/s
>> ClearMemory.testClearMemory6  thrpt   10  152732.184 ±  8740.128  ops/s
>> ClearMemory.testClearMemory7  thrpt   10  132111.536 ±  5493.043  ops/s
>> 
>> With Optimization:
>> 
>> Benchmark                      Mode  Cnt       Score       Error  Units
>> ClearMemory.testClearMemory3  thrpt   10  220378.082 ± 18533.701  ops/s
>> ClearMemory.testClearMemory4  thrpt   10  198023.913 ± 15995.780  ops/s
>> ClearMemory.testClearMemory5  thrpt   10  183476.886 ± 13488.821  ops/s
>> ClearMemory.testClearMemory6  thrpt   10  161710.750 ±  9270.182  ops/s
>> ClearMemory.testClearMemory7  thrpt   10  145059.426 ±  8217.484  ops/s
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8257772: Changes for 32 bit build

Submitted some quick testing for this and there are failures with tests in `compiler/c2/cr6340864/`:
#  Internal Error (workspace/open/src/hotspot/cpu/x86/macroAssembler_x86.cpp:8178), pid=27510, tid=27529
#  assert(MaxVectorSize >= 32) failed: vector length should be >= 32

Current CompileTask:
C2:    259   28    b        java.lang.StringCoding::encodeASCII (158 bytes)

Stack: [0x00007f2d144f8000,0x00007f2d145f9000],  sp=0x00007f2d145f3750,  free space=1005k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x13a326c]  MacroAssembler::fill64_avx(RegisterImpl*, int, XMMRegisterImpl*, bool)+0x11c
V  [libjvm.so+0x13a3415]  MacroAssembler::xmm_clear_mem(RegisterImpl*, RegisterImpl*, RegisterImpl*, XMMRegisterImpl*)+0x195
V  [libjvm.so+0x13a458b]  MacroAssembler::clear_mem(RegisterImpl*, RegisterImpl*, RegisterImpl*, XMMRegisterImpl*, bool)+0x19b
V  [libjvm.so+0x395487]  rep_stosNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x167
V  [libjvm.so+0x15b79da]  PhaseOutput::scratch_emit_size(Node const*)+0x3fa
V  [libjvm.so+0x15ae88c]  PhaseOutput::shorten_branches(unsigned int*)+0x2ac
V  [libjvm.so+0x15c045a]  PhaseOutput::Output()+0xcda
V  [libjvm.so+0xa0a798]  Compile::Code_Gen()+0x438
V  [libjvm.so+0xa13fe7]  Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, DirectiveSet*)+0x1917
V  [libjvm.so+0x8466ac]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1dc
V  [libjvm.so+0xa24498]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xe08
V  [libjvm.so+0xa24fe8]  CompileBroker::compiler_thread_loop()+0x5a8
V  [libjvm.so+0x18ae756]  JavaThread::thread_main_inner()+0x256
V  [libjvm.so+0x18b50e0]  Thread::call_run()+0x100
V  [libjvm.so+0x1598346]  thread_native_entry(Thread*)+0x116

Tests are executed with `-XX:CompileThreshold=100 -XX:-TieredCompilation`.

-------------

Changes requested by thartmann (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/1631


More information about the hotspot-dev mailing list