RFC: C2 Object Initialization - Using XMM/YMM registers
Vladimir Kozlov
vladimir.kozlov at oracle.com
Fri Apr 6 22:48:50 UTC 2018
The OpenJDK rule said that you need to contribute at least 2 significant
cahgnes and sign OCA (Oracle Contribution agreement) before applying
for Author status. Fortunately AMD had signed up as company already
which cover all its employees.
http://openjdk.java.net/projects/#project-author
You contributed only one 8187219 so far as I know. Need one more.
But you should be able to see contents of RFE. Right?
I added text and data from your e-mail.
Thanks,
Vladimir
On 4/5/18 10:04 PM, Rohit Arul Raj wrote:
> On Thu, Apr 5, 2018 at 11:38 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Good suggestion, Rohit
>>
>> I created new RFE. Please add you suggestion and performance data there:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8201193
>>
>
> Thanks Vladimir.
> I don't have an account/access to JDK bug data base yet. Is there any
> other way around it?
> Can I send in a request for Author role?
>
> Regards,
> Rohit
>
>>
>> On 4/5/18 12:19 AM, Rohit Arul Raj wrote:
>>>
>>> Hi All,
>>>
>>> I was going through the C2 object initialization (zeroing) code based
>>> on the below bug entry:
>>> https://bugs.openjdk.java.net/browse/JDK-8146801
>>>
>>> Right now, for longer lengths we use "rep stos" instructions on x86. I
>>> was experimenting with using XMM/YMM registers (on AMD EPYC processor)
>>> and found that they do improve performance for certain lengths:
>>>
>>> For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to
>>> 44%
>>> For lengths > 512bytes : some lengths show slight
>>> improvement in the range of 2% to 7%, others almost same as "rep stos"
>>> numbers.
>>>
>>> I have attached the complete performance data (data.txt) for reference .
>>> Can we add this as an user option similar to UseXMMForArrayCopy?
>>>
>>> I have used the same test case as in
>>> (http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
>>> additional sizes.
>>>
>>> Initial Patch:
>>> I haven't added the check for 32-bit mode as I need some help with the
>>> code (description given below the patch).
>>> The code is similar to the one used in array copy stubs
>>> (copy_bytes_forward).
>>>
>>> diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
>>> b/src/hotspot/cpu/x86/globals_x86.hpp
>>> --- a/src/hotspot/cpu/x86/globals_x86.hpp
>>> +++ b/src/hotspot/cpu/x86/globals_x86.hpp
>>> @@ -150,6 +150,9 @@
>>> product(bool, UseUnalignedLoadStores, false,
>>> \
>>> "Use SSE2 MOVDQU instruction for Arraycopy")
>>> \
>>>
>>> \
>>> + product(bool, UseXMMForObjInit, false,
>>> \
>>> + "Use XMM/YMM MOVDQU instruction for Object Initialization")
>>> \
>>> +
>>> \
>>> product(bool, UseFastStosb, false,
>>> \
>>> "Use fast-string operation for zeroing: rep stosb")
>>> \
>>>
>>> \
>>> diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> --- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> +++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> @@ -7106,6 +7106,56 @@
>>> if (UseFastStosb) {
>>> shlptr(cnt, 3); // convert to number of bytes
>>> rep_stosb();
>>> + } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
>>> + Label L_loop, L_sloop, L_check, L_tail, L_end;
>>> + push(base);
>>> + if (UseAVX >= 2)
>>> + vpxor(xmm10, xmm10, xmm10, AVX_256bit);
>>> + else
>>> + vpxor(xmm10, xmm10, xmm10, AVX_128bit);
>>> +
>>> + jmp(L_check);
>>> +
>>> + BIND(L_loop);
>>> + if (UseAVX >= 2) {
>>> + vmovdqu(Address(base, 0), xmm10);
>>> + vmovdqu(Address(base, 32), xmm10);
>>> + } else {
>>> + movdqu(Address(base, 0), xmm10);
>>> + movdqu(Address(base, 16), xmm10);
>>> + movdqu(Address(base, 32), xmm10);
>>> + movdqu(Address(base, 48), xmm10);
>>> + }
>>> + addptr(base, 64);
>>> +
>>> + BIND(L_check);
>>> + subptr(cnt, 8);
>>> + jccb(Assembler::greaterEqual, L_loop);
>>> + addptr(cnt, 4);
>>> + jccb(Assembler::less, L_tail);
>>> + // Copy trailing 32 bytes
>>> + if (UseAVX >= 2) {
>>> + vmovdqu(Address(base, 0), xmm10);
>>> + } else {
>>> + movdqu(Address(base, 0), xmm10);
>>> + movdqu(Address(base, 16), xmm10);
>>> + }
>>> + addptr(base, 32);
>>> + subptr(cnt, 4);
>>> +
>>> + BIND(L_tail);
>>> + addptr(cnt, 4);
>>> + jccb(Assembler::lessEqual, L_end);
>>> + decrement(cnt);
>>> +
>>> + BIND(L_sloop);
>>> + movptr(Address(base, 0), tmp);
>>> + addptr(base, 8);
>>> + decrement(cnt);
>>> + jccb(Assembler::greaterEqual, L_sloop);
>>> +
>>> + BIND(L_end);
>>> + pop(base);
>>> } else {
>>> NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
>>> for 32-bit VM
>>> rep_stos();
>>>
>>>
>>> When I use XMM0 as a temporary register, the micro-benchmark crashes.
>>> Saving and Restoring the XMM0 register before and after use works
>>> fine.
>>>
>>> Looking at the "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
>>> other XMM registers has been mentioned as Save-On-Call registers and
>>> on Linux ABI, no register is preserved across function calls though
>>> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
>>> saving/restoring should be fine.
>>>
>>> Is it incorrect use XMM* registers without saving/restoring them?
>>> Using XMM10 register as temporary register works fine without having
>>> to save and restore it.
>>>
>>> Please let me know your comments.
>>>
>>> Regards,
>>> Rohit
>>>
>>
More information about the hotspot-dev
mailing list