RFC: C2 Object Initialization - Using XMM/YMM registers

Rohit Arul Raj rohitarulraj at gmail.com
Thu Apr 5 07:19:44 UTC 2018


Hi All,

I was going through the C2 object initialization (zeroing) code based
on the below bug entry:
https://bugs.openjdk.java.net/browse/JDK-8146801

Right now, for longer lengths we use "rep stos" instructions on x86. I
was experimenting with using XMM/YMM registers (on AMD EPYC processor)
and found that they do improve performance for certain lengths:

For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to 44%
For lengths > 512bytes                   : some lengths show slight
improvement in the range of 2% to 7%, others almost same as "rep stos"
numbers.

I have attached the complete performance data (data.txt) for reference .
Can we add this as an user option similar to UseXMMForArrayCopy?

I have used the same test case as in
(http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
additional sizes.

Initial Patch:
I haven't added the check for 32-bit mode as I need some help with the
code (description given below the patch).
The code is similar to the one used in array copy stubs (copy_bytes_forward).

diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
b/src/hotspot/cpu/x86/globals_x86.hpp
--- a/src/hotspot/cpu/x86/globals_x86.hpp
+++ b/src/hotspot/cpu/x86/globals_x86.hpp
@@ -150,6 +150,9 @@
   product(bool, UseUnalignedLoadStores, false,                              \
           "Use SSE2 MOVDQU instruction for Arraycopy")                      \
                                                                             \
+  product(bool, UseXMMForObjInit, false,                                    \
+          "Use XMM/YMM MOVDQU instruction for Object Initialization")       \
+                                                                            \
   product(bool, UseFastStosb, false,                                        \
           "Use fast-string operation for zeroing: rep stosb")               \
                                                                             \
diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
--- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
+++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
@@ -7106,6 +7106,56 @@
   if (UseFastStosb) {
     shlptr(cnt, 3); // convert to number of bytes
     rep_stosb();
+  } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
+    Label L_loop, L_sloop, L_check, L_tail, L_end;
+    push(base);
+    if (UseAVX >= 2)
+      vpxor(xmm10, xmm10, xmm10, AVX_256bit);
+    else
+      vpxor(xmm10, xmm10, xmm10, AVX_128bit);
+
+    jmp(L_check);
+
+    BIND(L_loop);
+    if (UseAVX >= 2) {
+      vmovdqu(Address(base,  0), xmm10);
+      vmovdqu(Address(base, 32), xmm10);
+    } else {
+      movdqu(Address(base,  0), xmm10);
+      movdqu(Address(base, 16), xmm10);
+      movdqu(Address(base, 32), xmm10);
+      movdqu(Address(base, 48), xmm10);
+    }
+    addptr(base, 64);
+
+    BIND(L_check);
+    subptr(cnt, 8);
+    jccb(Assembler::greaterEqual, L_loop);
+    addptr(cnt, 4);
+    jccb(Assembler::less, L_tail);
+    // Copy trailing 32 bytes
+    if (UseAVX >= 2) {
+      vmovdqu(Address(base, 0), xmm10);
+    } else {
+      movdqu(Address(base,  0), xmm10);
+      movdqu(Address(base, 16), xmm10);
+    }
+    addptr(base, 32);
+    subptr(cnt, 4);
+
+    BIND(L_tail);
+    addptr(cnt, 4);
+    jccb(Assembler::lessEqual, L_end);
+    decrement(cnt);
+
+    BIND(L_sloop);
+    movptr(Address(base, 0), tmp);
+    addptr(base, 8);
+    decrement(cnt);
+    jccb(Assembler::greaterEqual, L_sloop);
+
+    BIND(L_end);
+    pop(base);
   } else {
     NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
for 32-bit VM
     rep_stos();


When I use XMM0 as a temporary register, the micro-benchmark crashes.
Saving and Restoring the XMM0 register before and after use works
fine.

Looking at the  "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
other XMM registers has been mentioned as Save-On-Call registers and
on Linux ABI, no register is preserved across function calls though
XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
saving/restoring should be fine.

Is it incorrect use XMM* registers without saving/restoring them?
Using XMM10 register as temporary register works fine without having
to save and restore it.

Please let me know your comments.

Regards,
Rohit
-------------- next part --------------
 -----------------------------------------------------------------------------------------------------------------
|S.No |  Array |        |        JDK11 trunk code ns/op     |         JDK11 trunk - ymm 64b loop ns/op		  |
|     |  Size  | Total  |                                   |                                                     |
|     |        | Size   |-----------------------------------|-----------------------------------------------------|
|     |        |        | Const  |variance|Variable|variance| Const  |variance|Variable|variance|%dif Con|%dif var|
|-----|--------|--------|-----------------------------------|-----------------------------------------------------|
|  1  |   0    |   0    |  8.59  |  0.00  |  8.98  |  0.01  |  8.59  |  0.00  |  8.98  |  0.01  | 0.01%  | -0.03% |
|  2  |   1    |   8    |  8.98  |  0.00  |  9.42  |  0.02  |  8.98  |  0.01  |  9.43  |  0.02  | 0.01%  | -0.10% |
|  3  |   2    |   8    |  8.98  |  0.00  |  9.43  |  0.01  |  8.98  |  0.00  |  9.43  |  0.02  | 0.01%  | -0.05% |
|  4  |   4    |   16   |  9.38  |  0.00  |  9.76  |  0.02  |  9.38  |  0.00  |  9.75  |  0.01  | 0.02%  | 0.05%  |
|  5  |   8    |   32   | 10.29  |  0.03  | 10.63  |  0.00  | 10.27  |  0.00  | 10.64  |  0.01  | 0.18%  | -0.09% |
|  6  |   16   |   64   | 12.10  |  0.02  | 12.57  |  0.02  | 12.09  |  0.01  | 12.55  |  0.01  | 0.08%  | 0.18%  |
|  7  |   24   |   96   | 15.21  |  0.47  | 20.66  |  0.59  | 12.71  |  0.20  | 12.78  |  0.04  | 16.45% | 38.15% |<==
|  8  |   32   |  128   | 16.83  |  0.01  | 23.40  |  0.59  | 15.37  |  0.06  | 15.55  |  0.06  | 8.69%  | 33.54% |
|  9  |   40   |  160   | 18.99  |  0.02  | 24.53  |  0.69  | 17.32  |  0.05  | 17.57  |  0.04  | 8.80%  | 28.37% |
| 10  |   56   |  224   | 27.28  |  0.26  | 31.04  |  0.21  | 21.85  |  0.14  | 22.77  |  0.04  | 19.88% | 26.65% |
| 11  |   64   |  256   | 31.02  |  0.13  | 51.65  |  0.59  | 24.73  |  0.14  | 29.22  |  0.16  | 20.27% | 43.42% |
| 12  |   96   |  384   | 59.82  |  0.10  | 64.09  |  0.12  | 50.46  |  0.11  | 53.13  |  0.24  | 15.64% | 17.09% |
| 13  |  128   |  512   | 69.83  |  0.59  | 71.77  |  0.61  | 63.34  |  0.13  | 64.45  |  0.62  | 9.29%  | 10.20% |<==
| 14  |  136   |  544   | 74.07  |  1.01  | 74.98  |  0.32  | 68.93  |  0.27  | 69.37  |  0.21  | 6.94%  | 7.48%  |
| 15  |  256   |  1 KB  | 121.87 |  0.29  | 122.32 |  0.21  | 117.21 |  0.50  | 119.35 |  0.24  | 3.83%  | 2.43%  |
| 16  |  512   |  2 KB  | 219.58 |  1.11  | 223.36 |  0.47  | 216.32 |  5.19  | 220.73 |  0.24  | 1.49%  | 1.18%  |
| 17  |  808   |  3 KB  | 323.24 |  0.64  | 342.78 |  2.15  | 319.93 |  0.81  | 331.41 |  0.46  | 1.02%  | 3.32%  |
| 18  |  1024  |  4 KB  | 421.69 |  0.85  | 451.06 |  1.22  | 398.47 |  0.60  | 430.78 |  0.75  | 5.51%  | 4.50%  |
| 19  |  2048  |  8 KB  | 857.81 |  0.77  | 865.11 |  0.95  | 810.42 |  0.50  | 847.25 |  0.52  | 5.53%  | 2.06%  |
| 20  |  4096  | 16 KB  |1612.11 |  5.13  |1613.38 |  2.29  |1583.33 |  7.34  |1598.82 |  1.90  | 1.79%  | 0.90%  |
| 21  |  8192  | 32 KB  |3100.39 |  3.48  |3094.99 |  4.22  |3067.74 |  2.86  |3069.04 | 16.91  | 1.05%  | 0.84%  |
| 22  | 16384  | 64 KB  |6059.48 | 10.50  |6073.39 |  8.60  |5978.82 |  4.18  |5971.72 |  5.98  | 1.33%  | 1.67%  |
| 23  | 32768  | 128 KB |12109.75| 29.16  |12178.25| 34.66  |11880.35|  8.51  |11861.76| 12.38  | 1.89%  | 2.60%  |
| 24  | 65536  | 256 KB |24303.89| 26.84  |24404.13| 37.95  |23606.47| 15.52  |23624.83| 39.52  | 2.87%  | 3.19%  |
| 25  | 131072 | 512 KB |49467.66| 95.65  |49216.41| 42.92  |48873.85| 70.28  |48599.38| 195.03 | 1.20%  | 1.25%  |
| 26  | 262144 |  1 MB  |102971.2|3149.34 |102631.9|3168.90 |100962.7|3610.16 |100691.8|3528.52 | 1.95%  | 1.89%  |
| 27  | 524288 |  2 MB  |223155.5| 286.52 |224287.9| 324.00 |223133.0| 283.44 |222802.6| 517.78 | 0.01%  | 0.66%  |
| 28  |1048576 |  4 MB  |447718.2| 221.75 |447240.2| 430.55 |445605.1| 617.14 |440841.5| 323.20 | 0.47%  | 1.43%  |
| 29  |2097152 |  8 MB  |891545.5| 968.99 |890070.5| 502.85 |888538.5| 775.27 |880552.1|2235.50 | 0.34%  | 1.07%  |
 -----------------------------------------------------------------------------------------------------------------


More information about the hotspot-dev mailing list