RFC: C2 Object Initialization - Using XMM/YMM registers

Thu Apr 5 18:08:57 UTC 2018

Good suggestion, Rohit

I created new RFE. Please add you suggestion and performance data there:

https://bugs.openjdk.java.net/browse/JDK-8201193

Thanks,
Vladimir

On 4/5/18 12:19 AM, Rohit Arul Raj wrote:
> Hi All,
> 
> I was going through the C2 object initialization (zeroing) code based
> on the below bug entry:
> https://bugs.openjdk.java.net/browse/JDK-8146801
> 
> Right now, for longer lengths we use "rep stos" instructions on x86. I
> was experimenting with using XMM/YMM registers (on AMD EPYC processor)
> and found that they do improve performance for certain lengths:
> 
> For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to 44%
> For lengths > 512bytes                   : some lengths show slight
> improvement in the range of 2% to 7%, others almost same as "rep stos"
> numbers.
> 
> I have attached the complete performance data (data.txt) for reference .
> Can we add this as an user option similar to UseXMMForArrayCopy?
> 
> I have used the same test case as in
> (http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
> additional sizes.
> 
> Initial Patch:
> I haven't added the check for 32-bit mode as I need some help with the
> code (description given below the patch).
> The code is similar to the one used in array copy stubs (copy_bytes_forward).
> 
> diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
> b/src/hotspot/cpu/x86/globals_x86.hpp
> --- a/src/hotspot/cpu/x86/globals_x86.hpp
> +++ b/src/hotspot/cpu/x86/globals_x86.hpp
> @@ -150,6 +150,9 @@
>     product(bool, UseUnalignedLoadStores, false,                              \
>             "Use SSE2 MOVDQU instruction for Arraycopy")                      \
>                                                                               \
> +  product(bool, UseXMMForObjInit, false,                                    \
> +          "Use XMM/YMM MOVDQU instruction for Object Initialization")       \
> +                                                                            \
>     product(bool, UseFastStosb, false,                                        \
>             "Use fast-string operation for zeroing: rep stosb")               \
>                                                                               \
> diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
> b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
> --- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
> +++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
> @@ -7106,6 +7106,56 @@
>     if (UseFastStosb) {
>       shlptr(cnt, 3); // convert to number of bytes
>       rep_stosb();
> +  } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
> +    Label L_loop, L_sloop, L_check, L_tail, L_end;
> +    push(base);
> +    if (UseAVX >= 2)
> +      vpxor(xmm10, xmm10, xmm10, AVX_256bit);
> +    else
> +      vpxor(xmm10, xmm10, xmm10, AVX_128bit);
> +
> +    jmp(L_check);
> +
> +    BIND(L_loop);
> +    if (UseAVX >= 2) {
> +      vmovdqu(Address(base,  0), xmm10);
> +      vmovdqu(Address(base, 32), xmm10);
> +    } else {
> +      movdqu(Address(base,  0), xmm10);
> +      movdqu(Address(base, 16), xmm10);
> +      movdqu(Address(base, 32), xmm10);
> +      movdqu(Address(base, 48), xmm10);
> +    }
> +    addptr(base, 64);
> +
> +    BIND(L_check);
> +    subptr(cnt, 8);
> +    jccb(Assembler::greaterEqual, L_loop);
> +    addptr(cnt, 4);
> +    jccb(Assembler::less, L_tail);
> +    // Copy trailing 32 bytes
> +    if (UseAVX >= 2) {
> +      vmovdqu(Address(base, 0), xmm10);
> +    } else {
> +      movdqu(Address(base,  0), xmm10);
> +      movdqu(Address(base, 16), xmm10);
> +    }
> +    addptr(base, 32);
> +    subptr(cnt, 4);
> +
> +    BIND(L_tail);
> +    addptr(cnt, 4);
> +    jccb(Assembler::lessEqual, L_end);
> +    decrement(cnt);
> +
> +    BIND(L_sloop);
> +    movptr(Address(base, 0), tmp);
> +    addptr(base, 8);
> +    decrement(cnt);
> +    jccb(Assembler::greaterEqual, L_sloop);
> +
> +    BIND(L_end);
> +    pop(base);
>     } else {
>       NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
> for 32-bit VM
>       rep_stos();
> 
> 
> When I use XMM0 as a temporary register, the micro-benchmark crashes.
> Saving and Restoring the XMM0 register before and after use works
> fine.
> 
> Looking at the  "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
> other XMM registers has been mentioned as Save-On-Call registers and
> on Linux ABI, no register is preserved across function calls though
> XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
> saving/restoring should be fine.
> 
> Is it incorrect use XMM* registers without saving/restoring them?
> Using XMM10 register as temporary register works fine without having
> to save and restore it.
> 
> Please let me know your comments.
> 
> Regards,
> Rohit
>