RFC: C2 Object Initialization - Using XMM/YMM registers

Fri Apr 6 22:48:50 UTC 2018

The OpenJDK rule said that you need to contribute at least 2 significant 
cahgnes and sign OCA (Oracle Contribution agreement)  before applying 
for Author status. Fortunately AMD had signed up as company already 
which cover all its employees.

http://openjdk.java.net/projects/#project-author

You contributed only one 8187219 so far as I know. Need one more.

But you should be able to see contents of RFE. Right?

I added text and data from your e-mail.

Thanks,
Vladimir

On 4/5/18 10:04 PM, Rohit Arul Raj wrote:
> On Thu, Apr 5, 2018 at 11:38 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Good suggestion, Rohit
>>
>> I created new RFE. Please add you suggestion and performance data there:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8201193
>>
> 
> Thanks Vladimir.
> I don't have an account/access to JDK bug data base yet. Is there any
> other way around it?
> Can I send in a request for Author role?
> 
> Regards,
> Rohit
> 
>>
>> On 4/5/18 12:19 AM, Rohit Arul Raj wrote:
>>>
>>> Hi All,
>>>
>>> I was going through the C2 object initialization (zeroing) code based
>>> on the below bug entry:
>>> https://bugs.openjdk.java.net/browse/JDK-8146801
>>>
>>> Right now, for longer lengths we use "rep stos" instructions on x86. I
>>> was experimenting with using XMM/YMM registers (on AMD EPYC processor)
>>> and found that they do improve performance for certain lengths:
>>>
>>> For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to
>>> 44%
>>> For lengths > 512bytes                   : some lengths show slight
>>> improvement in the range of 2% to 7%, others almost same as "rep stos"
>>> numbers.
>>>
>>> I have attached the complete performance data (data.txt) for reference .
>>> Can we add this as an user option similar to UseXMMForArrayCopy?
>>>
>>> I have used the same test case as in
>>> (http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
>>> additional sizes.
>>>
>>> Initial Patch:
>>> I haven't added the check for 32-bit mode as I need some help with the
>>> code (description given below the patch).
>>> The code is similar to the one used in array copy stubs
>>> (copy_bytes_forward).
>>>
>>> diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
>>> b/src/hotspot/cpu/x86/globals_x86.hpp
>>> --- a/src/hotspot/cpu/x86/globals_x86.hpp
>>> +++ b/src/hotspot/cpu/x86/globals_x86.hpp
>>> @@ -150,6 +150,9 @@
>>>      product(bool, UseUnalignedLoadStores, false,
>>> \
>>>              "Use SSE2 MOVDQU instruction for Arraycopy")
>>> \
>>>
>>> \
>>> +  product(bool, UseXMMForObjInit, false,
>>> \
>>> +          "Use XMM/YMM MOVDQU instruction for Object Initialization")
>>> \
>>> +
>>> \
>>>      product(bool, UseFastStosb, false,
>>> \
>>>              "Use fast-string operation for zeroing: rep stosb")
>>> \
>>>
>>> \
>>> diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> --- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> +++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> @@ -7106,6 +7106,56 @@
>>>      if (UseFastStosb) {
>>>        shlptr(cnt, 3); // convert to number of bytes
>>>        rep_stosb();
>>> +  } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
>>> +    Label L_loop, L_sloop, L_check, L_tail, L_end;
>>> +    push(base);
>>> +    if (UseAVX >= 2)
>>> +      vpxor(xmm10, xmm10, xmm10, AVX_256bit);
>>> +    else
>>> +      vpxor(xmm10, xmm10, xmm10, AVX_128bit);
>>> +
>>> +    jmp(L_check);
>>> +
>>> +    BIND(L_loop);
>>> +    if (UseAVX >= 2) {
>>> +      vmovdqu(Address(base,  0), xmm10);
>>> +      vmovdqu(Address(base, 32), xmm10);
>>> +    } else {
>>> +      movdqu(Address(base,  0), xmm10);
>>> +      movdqu(Address(base, 16), xmm10);
>>> +      movdqu(Address(base, 32), xmm10);
>>> +      movdqu(Address(base, 48), xmm10);
>>> +    }
>>> +    addptr(base, 64);
>>> +
>>> +    BIND(L_check);
>>> +    subptr(cnt, 8);
>>> +    jccb(Assembler::greaterEqual, L_loop);
>>> +    addptr(cnt, 4);
>>> +    jccb(Assembler::less, L_tail);
>>> +    // Copy trailing 32 bytes
>>> +    if (UseAVX >= 2) {
>>> +      vmovdqu(Address(base, 0), xmm10);
>>> +    } else {
>>> +      movdqu(Address(base,  0), xmm10);
>>> +      movdqu(Address(base, 16), xmm10);
>>> +    }
>>> +    addptr(base, 32);
>>> +    subptr(cnt, 4);
>>> +
>>> +    BIND(L_tail);
>>> +    addptr(cnt, 4);
>>> +    jccb(Assembler::lessEqual, L_end);
>>> +    decrement(cnt);
>>> +
>>> +    BIND(L_sloop);
>>> +    movptr(Address(base, 0), tmp);
>>> +    addptr(base, 8);
>>> +    decrement(cnt);
>>> +    jccb(Assembler::greaterEqual, L_sloop);
>>> +
>>> +    BIND(L_end);
>>> +    pop(base);
>>>      } else {
>>>        NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
>>> for 32-bit VM
>>>        rep_stos();
>>>
>>>
>>> When I use XMM0 as a temporary register, the micro-benchmark crashes.
>>> Saving and Restoring the XMM0 register before and after use works
>>> fine.
>>>
>>> Looking at the  "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
>>> other XMM registers has been mentioned as Save-On-Call registers and
>>> on Linux ABI, no register is preserved across function calls though
>>> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
>>> saving/restoring should be fine.
>>>
>>> Is it incorrect use XMM* registers without saving/restoring them?
>>> Using XMM10 register as temporary register works fine without having
>>> to save and restore it.
>>>
>>> Please let me know your comments.
>>>
>>> Regards,
>>> Rohit
>>>
>>