RFR: 8257772: Vectorizing clear memory operation using AVX-512 masked operations [v2]
Tobias Hartmann
thartmann at openjdk.java.net
Mon Dec 7 08:43:14 UTC 2020
On Sat, 5 Dec 2020 18:07:28 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> A newly allocated memory is initialized either using user provided initialization values for various fields or setting the memory to zero as per java semantics (System initialization).
>>
>> C2 compiler creates ClearArray Node in order to perform system initialization. ClearArray accepts the number of Heap Words to be initialized, this number can be constant or a non-constant value. For constant number of heap words less than InitArrayShortSize (default value 64 bytes) currently compiler generates StoreL nodes which does the initialization at the granularity of 8 bytes.
>>
>> This patch vectorizes the initializing store operations for constant sized heap word less than InitArrayShortSize by emitting special instruction sequence for various tail sizes.
>>
>> In addition existing implementation for initialization under UseXMMForObjInit is extended to use masked operation to optimize tail initialization sequence. In case AVX3Threshold is set to 0 then new initialization sequence uses 64 byte ZMM registers.
>>
>> Following are the performance stats collected using micro-benchmark included with the patch.
>>
>> Testing : Tier1-Tier3 level tests are clean.
>>
>> System Configuration : Cascadelake, Intel Xeon Platinum 8280L @ 2.7 GHz, 2 socket, 28 cores per socket.
>>
>> Baseline:
>> Benchmark Mode Cnt Score Error Units
>> ClearMemory.testClearMemory3 thrpt 10 212508.522 ± 14071.493 ops/s
>> ClearMemory.testClearMemory4 thrpt 10 189530.643 ± 12882.421 ops/s
>> ClearMemory.testClearMemory5 thrpt 10 167878.803 ± 10307.163 ops/s
>> ClearMemory.testClearMemory6 thrpt 10 152732.184 ± 8740.128 ops/s
>> ClearMemory.testClearMemory7 thrpt 10 132111.536 ± 5493.043 ops/s
>>
>> With Optimization:
>>
>> Benchmark Mode Cnt Score Error Units
>> ClearMemory.testClearMemory3 thrpt 10 220378.082 ± 18533.701 ops/s
>> ClearMemory.testClearMemory4 thrpt 10 198023.913 ± 15995.780 ops/s
>> ClearMemory.testClearMemory5 thrpt 10 183476.886 ± 13488.821 ops/s
>> ClearMemory.testClearMemory6 thrpt 10 161710.750 ± 9270.182 ops/s
>> ClearMemory.testClearMemory7 thrpt 10 145059.426 ± 8217.484 ops/s
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> 8257772: Changes for 32 bit build
Submitted some quick testing for this and there are failures with tests in `compiler/c2/cr6340864/`:
# Internal Error (workspace/open/src/hotspot/cpu/x86/macroAssembler_x86.cpp:8178), pid=27510, tid=27529
# assert(MaxVectorSize >= 32) failed: vector length should be >= 32
Current CompileTask:
C2: 259 28 b java.lang.StringCoding::encodeASCII (158 bytes)
Stack: [0x00007f2d144f8000,0x00007f2d145f9000], sp=0x00007f2d145f3750, free space=1005k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x13a326c] MacroAssembler::fill64_avx(RegisterImpl*, int, XMMRegisterImpl*, bool)+0x11c
V [libjvm.so+0x13a3415] MacroAssembler::xmm_clear_mem(RegisterImpl*, RegisterImpl*, RegisterImpl*, XMMRegisterImpl*)+0x195
V [libjvm.so+0x13a458b] MacroAssembler::clear_mem(RegisterImpl*, RegisterImpl*, RegisterImpl*, XMMRegisterImpl*, bool)+0x19b
V [libjvm.so+0x395487] rep_stosNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x167
V [libjvm.so+0x15b79da] PhaseOutput::scratch_emit_size(Node const*)+0x3fa
V [libjvm.so+0x15ae88c] PhaseOutput::shorten_branches(unsigned int*)+0x2ac
V [libjvm.so+0x15c045a] PhaseOutput::Output()+0xcda
V [libjvm.so+0xa0a798] Compile::Code_Gen()+0x438
V [libjvm.so+0xa13fe7] Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, DirectiveSet*)+0x1917
V [libjvm.so+0x8466ac] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1dc
V [libjvm.so+0xa24498] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xe08
V [libjvm.so+0xa24fe8] CompileBroker::compiler_thread_loop()+0x5a8
V [libjvm.so+0x18ae756] JavaThread::thread_main_inner()+0x256
V [libjvm.so+0x18b50e0] Thread::call_run()+0x100
V [libjvm.so+0x1598346] thread_native_entry(Thread*)+0x116
Tests are executed with `-XX:CompileThreshold=100 -XX:-TieredCompilation`.
-------------
Changes requested by thartmann (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/1631
More information about the hotspot-compiler-dev
mailing list