RFC: C2 Object Initialization - Using XMM/YMM registers

Mon Apr 23 19:03:32 UTC 2018

Sorry for delay.

In general you can't use arbitrary registers without letting know JIT 
compilers that you use it. It will definitely cause problems.
You need to pass it as additional XMMRegister argument and described it 
as TEMP in .ad files.

See byte_array_inflate() as example.

On 4/11/18 7:25 PM, Rohit Arul Raj wrote:
>>> When I use XMM0 as a temporary register, the micro-benchmark crashes.
>>> Saving and Restoring the XMM0 register before and after use works
>>> fine.
>>>
>>> Looking at the  "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
>>> other XMM registers has been mentioned as Save-On-Call registers and
>>> on Linux ABI, no register is preserved across function calls though
>>> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
>>> saving/restoring should be fine.
>>>
>>> Is it incorrect use XMM* registers without saving/restoring them?
>>> Using XMM10 register as temporary register works fine without having
>>> to save and restore it.
> 
> Any comments/suggestions on the usage of XMM* registers?
> 
> Thanks,
> Rohit
> 
> On Thu, Apr 5, 2018 at 11:38 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Good suggestion, Rohit
>>
>> I created new RFE. Please add you suggestion and performance data there:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8201193
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 4/5/18 12:19 AM, Rohit Arul Raj wrote:
>>>
>>> Hi All,
>>>
>>> I was going through the C2 object initialization (zeroing) code based
>>> on the below bug entry:
>>> https://bugs.openjdk.java.net/browse/JDK-8146801
>>>
>>> Right now, for longer lengths we use "rep stos" instructions on x86. I
>>> was experimenting with using XMM/YMM registers (on AMD EPYC processor)
>>> and found that they do improve performance for certain lengths:
>>>
>>> For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to
>>> 44%
>>> For lengths > 512bytes                   : some lengths show slight
>>> improvement in the range of 2% to 7%, others almost same as "rep stos"
>>> numbers.
>>>
>>> I have attached the complete performance data (data.txt) for reference .
>>> Can we add this as an user option similar to UseXMMForArrayCopy?
>>>
>>> I have used the same test case as in
>>> (http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
>>> additional sizes.
>>>
>>> Initial Patch:
>>> I haven't added the check for 32-bit mode as I need some help with the
>>> code (description given below the patch).
>>> The code is similar to the one used in array copy stubs
>>> (copy_bytes_forward).
>>>
>>> diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
>>> b/src/hotspot/cpu/x86/globals_x86.hpp
>>> --- a/src/hotspot/cpu/x86/globals_x86.hpp
>>> +++ b/src/hotspot/cpu/x86/globals_x86.hpp
>>> @@ -150,6 +150,9 @@
>>>      product(bool, UseUnalignedLoadStores, false,
>>> \
>>>              "Use SSE2 MOVDQU instruction for Arraycopy")
>>> \
>>>
>>> \
>>> +  product(bool, UseXMMForObjInit, false,
>>> \
>>> +          "Use XMM/YMM MOVDQU instruction for Object Initialization")
>>> \
>>> +
>>> \
>>>      product(bool, UseFastStosb, false,
>>> \
>>>              "Use fast-string operation for zeroing: rep stosb")
>>> \
>>>
>>> \
>>> diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> --- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> +++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> @@ -7106,6 +7106,56 @@
>>>      if (UseFastStosb) {
>>>        shlptr(cnt, 3); // convert to number of bytes
>>>        rep_stosb();
>>> +  } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
>>> +    Label L_loop, L_sloop, L_check, L_tail, L_end;
>>> +    push(base);
>>> +    if (UseAVX >= 2)
>>> +      vpxor(xmm10, xmm10, xmm10, AVX_256bit);
>>> +    else
>>> +      vpxor(xmm10, xmm10, xmm10, AVX_128bit);
>>> +
>>> +    jmp(L_check);
>>> +
>>> +    BIND(L_loop);
>>> +    if (UseAVX >= 2) {
>>> +      vmovdqu(Address(base,  0), xmm10);
>>> +      vmovdqu(Address(base, 32), xmm10);
>>> +    } else {
>>> +      movdqu(Address(base,  0), xmm10);
>>> +      movdqu(Address(base, 16), xmm10);
>>> +      movdqu(Address(base, 32), xmm10);
>>> +      movdqu(Address(base, 48), xmm10);
>>> +    }
>>> +    addptr(base, 64);
>>> +
>>> +    BIND(L_check);
>>> +    subptr(cnt, 8);
>>> +    jccb(Assembler::greaterEqual, L_loop);
>>> +    addptr(cnt, 4);
>>> +    jccb(Assembler::less, L_tail);
>>> +    // Copy trailing 32 bytes
>>> +    if (UseAVX >= 2) {
>>> +      vmovdqu(Address(base, 0), xmm10);
>>> +    } else {
>>> +      movdqu(Address(base,  0), xmm10);
>>> +      movdqu(Address(base, 16), xmm10);
>>> +    }
>>> +    addptr(base, 32);
>>> +    subptr(cnt, 4);
>>> +
>>> +    BIND(L_tail);
>>> +    addptr(cnt, 4);
>>> +    jccb(Assembler::lessEqual, L_end);
>>> +    decrement(cnt);
>>> +
>>> +    BIND(L_sloop);
>>> +    movptr(Address(base, 0), tmp);
>>> +    addptr(base, 8);
>>> +    decrement(cnt);
>>> +    jccb(Assembler::greaterEqual, L_sloop);
>>> +
>>> +    BIND(L_end);
>>> +    pop(base);
>>>      } else {
>>>        NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
>>> for 32-bit VM
>>>        rep_stos();
>>>
>>>
>>> When I use XMM0 as a temporary register, the micro-benchmark crashes.
>>> Saving and Restoring the XMM0 register before and after use works
>>> fine.
>>>
>>> Looking at the  "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
>>> other XMM registers has been mentioned as Save-On-Call registers and
>>> on Linux ABI, no register is preserved across function calls though
>>> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
>>> saving/restoring should be fine.
>>>
>>> Is it incorrect use XMM* registers without saving/restoring them?
>>> Using XMM10 register as temporary register works fine without having
>>> to save and restore it.
>>>
>>> Please let me know your comments.
>>>
>>> Regards,
>>> Rohit
>>>
>>