RFC: C2 Object Initialization - Using XMM/YMM registers

Rohit Arul Raj rohitarulraj at gmail.com
Thu Apr 12 02:25:36 UTC 2018


>> When I use XMM0 as a temporary register, the micro-benchmark crashes.
>> Saving and Restoring the XMM0 register before and after use works
>> fine.
>>
>> Looking at the  "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
>> other XMM registers has been mentioned as Save-On-Call registers and
>> on Linux ABI, no register is preserved across function calls though
>> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
>> saving/restoring should be fine.
>>
>> Is it incorrect use XMM* registers without saving/restoring them?
>> Using XMM10 register as temporary register works fine without having
>> to save and restore it.

Any comments/suggestions on the usage of XMM* registers?

Thanks,
Rohit

On Thu, Apr 5, 2018 at 11:38 PM, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> Good suggestion, Rohit
>
> I created new RFE. Please add you suggestion and performance data there:
>
> https://bugs.openjdk.java.net/browse/JDK-8201193
>
> Thanks,
> Vladimir
>
>
> On 4/5/18 12:19 AM, Rohit Arul Raj wrote:
>>
>> Hi All,
>>
>> I was going through the C2 object initialization (zeroing) code based
>> on the below bug entry:
>> https://bugs.openjdk.java.net/browse/JDK-8146801
>>
>> Right now, for longer lengths we use "rep stos" instructions on x86. I
>> was experimenting with using XMM/YMM registers (on AMD EPYC processor)
>> and found that they do improve performance for certain lengths:
>>
>> For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to
>> 44%
>> For lengths > 512bytes                   : some lengths show slight
>> improvement in the range of 2% to 7%, others almost same as "rep stos"
>> numbers.
>>
>> I have attached the complete performance data (data.txt) for reference .
>> Can we add this as an user option similar to UseXMMForArrayCopy?
>>
>> I have used the same test case as in
>> (http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
>> additional sizes.
>>
>> Initial Patch:
>> I haven't added the check for 32-bit mode as I need some help with the
>> code (description given below the patch).
>> The code is similar to the one used in array copy stubs
>> (copy_bytes_forward).
>>
>> diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
>> b/src/hotspot/cpu/x86/globals_x86.hpp
>> --- a/src/hotspot/cpu/x86/globals_x86.hpp
>> +++ b/src/hotspot/cpu/x86/globals_x86.hpp
>> @@ -150,6 +150,9 @@
>>     product(bool, UseUnalignedLoadStores, false,
>> \
>>             "Use SSE2 MOVDQU instruction for Arraycopy")
>> \
>>
>> \
>> +  product(bool, UseXMMForObjInit, false,
>> \
>> +          "Use XMM/YMM MOVDQU instruction for Object Initialization")
>> \
>> +
>> \
>>     product(bool, UseFastStosb, false,
>> \
>>             "Use fast-string operation for zeroing: rep stosb")
>> \
>>
>> \
>> diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>> b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>> --- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>> +++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>> @@ -7106,6 +7106,56 @@
>>     if (UseFastStosb) {
>>       shlptr(cnt, 3); // convert to number of bytes
>>       rep_stosb();
>> +  } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
>> +    Label L_loop, L_sloop, L_check, L_tail, L_end;
>> +    push(base);
>> +    if (UseAVX >= 2)
>> +      vpxor(xmm10, xmm10, xmm10, AVX_256bit);
>> +    else
>> +      vpxor(xmm10, xmm10, xmm10, AVX_128bit);
>> +
>> +    jmp(L_check);
>> +
>> +    BIND(L_loop);
>> +    if (UseAVX >= 2) {
>> +      vmovdqu(Address(base,  0), xmm10);
>> +      vmovdqu(Address(base, 32), xmm10);
>> +    } else {
>> +      movdqu(Address(base,  0), xmm10);
>> +      movdqu(Address(base, 16), xmm10);
>> +      movdqu(Address(base, 32), xmm10);
>> +      movdqu(Address(base, 48), xmm10);
>> +    }
>> +    addptr(base, 64);
>> +
>> +    BIND(L_check);
>> +    subptr(cnt, 8);
>> +    jccb(Assembler::greaterEqual, L_loop);
>> +    addptr(cnt, 4);
>> +    jccb(Assembler::less, L_tail);
>> +    // Copy trailing 32 bytes
>> +    if (UseAVX >= 2) {
>> +      vmovdqu(Address(base, 0), xmm10);
>> +    } else {
>> +      movdqu(Address(base,  0), xmm10);
>> +      movdqu(Address(base, 16), xmm10);
>> +    }
>> +    addptr(base, 32);
>> +    subptr(cnt, 4);
>> +
>> +    BIND(L_tail);
>> +    addptr(cnt, 4);
>> +    jccb(Assembler::lessEqual, L_end);
>> +    decrement(cnt);
>> +
>> +    BIND(L_sloop);
>> +    movptr(Address(base, 0), tmp);
>> +    addptr(base, 8);
>> +    decrement(cnt);
>> +    jccb(Assembler::greaterEqual, L_sloop);
>> +
>> +    BIND(L_end);
>> +    pop(base);
>>     } else {
>>       NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
>> for 32-bit VM
>>       rep_stos();
>>
>>
>> When I use XMM0 as a temporary register, the micro-benchmark crashes.
>> Saving and Restoring the XMM0 register before and after use works
>> fine.
>>
>> Looking at the  "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
>> other XMM registers has been mentioned as Save-On-Call registers and
>> on Linux ABI, no register is preserved across function calls though
>> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
>> saving/restoring should be fine.
>>
>> Is it incorrect use XMM* registers without saving/restoring them?
>> Using XMM10 register as temporary register works fine without having
>> to save and restore it.
>>
>> Please let me know your comments.
>>
>> Regards,
>> Rohit
>>
>


More information about the hotspot-dev mailing list