RFC: C2 Object Initialization - Using XMM/YMM registers
Rohit Arul Raj
rohitarulraj at gmail.com
Fri Apr 6 23:40:59 UTC 2018
On Sat, Apr 7, 2018 at 4:18 AM, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> The OpenJDK rule said that you need to contribute at least 2 significant
> cahgnes and sign OCA (Oracle Contribution agreement) before applying for
> Author status. Fortunately AMD had signed up as company already which cover
> all its employees.
>
> http://openjdk.java.net/projects/#project-author
>
> You contributed only one 8187219 so far as I know. Need one more.
>
> But you should be able to see contents of RFE. Right?
Thanks Vladimir, I can see the contents of RFE.
Regards,
Rohit
> On 4/5/18 10:04 PM, Rohit Arul Raj wrote:
>>
>> On Thu, Apr 5, 2018 at 11:38 PM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>>
>>> Good suggestion, Rohit
>>>
>>> I created new RFE. Please add you suggestion and performance data there:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8201193
>>>
>>
>> Thanks Vladimir.
>> I don't have an account/access to JDK bug data base yet. Is there any
>> other way around it?
>> Can I send in a request for Author role?
>>
>> Regards,
>> Rohit
>>
>>>
>>> On 4/5/18 12:19 AM, Rohit Arul Raj wrote:
>>>>
>>>>
>>>> Hi All,
>>>>
>>>> I was going through the C2 object initialization (zeroing) code based
>>>> on the below bug entry:
>>>> https://bugs.openjdk.java.net/browse/JDK-8146801
>>>>
>>>> Right now, for longer lengths we use "rep stos" instructions on x86. I
>>>> was experimenting with using XMM/YMM registers (on AMD EPYC processor)
>>>> and found that they do improve performance for certain lengths:
>>>>
>>>> For lengths > 64 bytes - 512 bytes : improvement is in the range of 8%
>>>> to
>>>> 44%
>>>> For lengths > 512bytes : some lengths show slight
>>>> improvement in the range of 2% to 7%, others almost same as "rep stos"
>>>> numbers.
>>>>
>>>> I have attached the complete performance data (data.txt) for reference .
>>>> Can we add this as an user option similar to UseXMMForArrayCopy?
>>>>
>>>> I have used the same test case as in
>>>> (http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
>>>> additional sizes.
>>>>
>>>> Initial Patch:
>>>> I haven't added the check for 32-bit mode as I need some help with the
>>>> code (description given below the patch).
>>>> The code is similar to the one used in array copy stubs
>>>> (copy_bytes_forward).
>>>>
>>>> diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
>>>> b/src/hotspot/cpu/x86/globals_x86.hpp
>>>> --- a/src/hotspot/cpu/x86/globals_x86.hpp
>>>> +++ b/src/hotspot/cpu/x86/globals_x86.hpp
>>>> @@ -150,6 +150,9 @@
>>>> product(bool, UseUnalignedLoadStores, false,
>>>> \
>>>> "Use SSE2 MOVDQU instruction for Arraycopy")
>>>> \
>>>>
>>>> \
>>>> + product(bool, UseXMMForObjInit, false,
>>>> \
>>>> + "Use XMM/YMM MOVDQU instruction for Object Initialization")
>>>> \
>>>> +
>>>> \
>>>> product(bool, UseFastStosb, false,
>>>> \
>>>> "Use fast-string operation for zeroing: rep stosb")
>>>> \
>>>>
>>>> \
>>>> diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>>> b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>>> --- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>>> +++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>>> @@ -7106,6 +7106,56 @@
>>>> if (UseFastStosb) {
>>>> shlptr(cnt, 3); // convert to number of bytes
>>>> rep_stosb();
>>>> + } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
>>>> + Label L_loop, L_sloop, L_check, L_tail, L_end;
>>>> + push(base);
>>>> + if (UseAVX >= 2)
>>>> + vpxor(xmm10, xmm10, xmm10, AVX_256bit);
>>>> + else
>>>> + vpxor(xmm10, xmm10, xmm10, AVX_128bit);
>>>> +
>>>> + jmp(L_check);
>>>> +
>>>> + BIND(L_loop);
>>>> + if (UseAVX >= 2) {
>>>> + vmovdqu(Address(base, 0), xmm10);
>>>> + vmovdqu(Address(base, 32), xmm10);
>>>> + } else {
>>>> + movdqu(Address(base, 0), xmm10);
>>>> + movdqu(Address(base, 16), xmm10);
>>>> + movdqu(Address(base, 32), xmm10);
>>>> + movdqu(Address(base, 48), xmm10);
>>>> + }
>>>> + addptr(base, 64);
>>>> +
>>>> + BIND(L_check);
>>>> + subptr(cnt, 8);
>>>> + jccb(Assembler::greaterEqual, L_loop);
>>>> + addptr(cnt, 4);
>>>> + jccb(Assembler::less, L_tail);
>>>> + // Copy trailing 32 bytes
>>>> + if (UseAVX >= 2) {
>>>> + vmovdqu(Address(base, 0), xmm10);
>>>> + } else {
>>>> + movdqu(Address(base, 0), xmm10);
>>>> + movdqu(Address(base, 16), xmm10);
>>>> + }
>>>> + addptr(base, 32);
>>>> + subptr(cnt, 4);
>>>> +
>>>> + BIND(L_tail);
>>>> + addptr(cnt, 4);
>>>> + jccb(Assembler::lessEqual, L_end);
>>>> + decrement(cnt);
>>>> +
>>>> + BIND(L_sloop);
>>>> + movptr(Address(base, 0), tmp);
>>>> + addptr(base, 8);
>>>> + decrement(cnt);
>>>> + jccb(Assembler::greaterEqual, L_sloop);
>>>> +
>>>> + BIND(L_end);
>>>> + pop(base);
>>>> } else {
>>>> NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
>>>> for 32-bit VM
>>>> rep_stos();
>>>>
>>>>
>>>> When I use XMM0 as a temporary register, the micro-benchmark crashes.
>>>> Saving and Restoring the XMM0 register before and after use works
>>>> fine.
>>>>
>>>> Looking at the "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
>>>> other XMM registers has been mentioned as Save-On-Call registers and
>>>> on Linux ABI, no register is preserved across function calls though
>>>> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
>>>> saving/restoring should be fine.
>>>>
>>>> Is it incorrect use XMM* registers without saving/restoring them?
>>>> Using XMM10 register as temporary register works fine without having
>>>> to save and restore it.
>>>>
>>>> Please let me know your comments.
>>>>
>>>> Regards,
>>>> Rohit
>>>>
>>>
>
More information about the hotspot-dev
mailing list