RFC: C2 Object Initialization - Using XMM/YMM registers
Rohit Arul Raj
rohitarulraj at gmail.com
Thu Apr 5 07:19:44 UTC 2018
Hi All,
I was going through the C2 object initialization (zeroing) code based
on the below bug entry:
https://bugs.openjdk.java.net/browse/JDK-8146801
Right now, for longer lengths we use "rep stos" instructions on x86. I
was experimenting with using XMM/YMM registers (on AMD EPYC processor)
and found that they do improve performance for certain lengths:
For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to 44%
For lengths > 512bytes : some lengths show slight
improvement in the range of 2% to 7%, others almost same as "rep stos"
numbers.
I have attached the complete performance data (data.txt) for reference .
Can we add this as an user option similar to UseXMMForArrayCopy?
I have used the same test case as in
(http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
additional sizes.
Initial Patch:
I haven't added the check for 32-bit mode as I need some help with the
code (description given below the patch).
The code is similar to the one used in array copy stubs (copy_bytes_forward).
diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
b/src/hotspot/cpu/x86/globals_x86.hpp
--- a/src/hotspot/cpu/x86/globals_x86.hpp
+++ b/src/hotspot/cpu/x86/globals_x86.hpp
@@ -150,6 +150,9 @@
product(bool, UseUnalignedLoadStores, false, \
"Use SSE2 MOVDQU instruction for Arraycopy") \
\
+ product(bool, UseXMMForObjInit, false, \
+ "Use XMM/YMM MOVDQU instruction for Object Initialization") \
+ \
product(bool, UseFastStosb, false, \
"Use fast-string operation for zeroing: rep stosb") \
\
diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
--- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
+++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
@@ -7106,6 +7106,56 @@
if (UseFastStosb) {
shlptr(cnt, 3); // convert to number of bytes
rep_stosb();
+ } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
+ Label L_loop, L_sloop, L_check, L_tail, L_end;
+ push(base);
+ if (UseAVX >= 2)
+ vpxor(xmm10, xmm10, xmm10, AVX_256bit);
+ else
+ vpxor(xmm10, xmm10, xmm10, AVX_128bit);
+
+ jmp(L_check);
+
+ BIND(L_loop);
+ if (UseAVX >= 2) {
+ vmovdqu(Address(base, 0), xmm10);
+ vmovdqu(Address(base, 32), xmm10);
+ } else {
+ movdqu(Address(base, 0), xmm10);
+ movdqu(Address(base, 16), xmm10);
+ movdqu(Address(base, 32), xmm10);
+ movdqu(Address(base, 48), xmm10);
+ }
+ addptr(base, 64);
+
+ BIND(L_check);
+ subptr(cnt, 8);
+ jccb(Assembler::greaterEqual, L_loop);
+ addptr(cnt, 4);
+ jccb(Assembler::less, L_tail);
+ // Copy trailing 32 bytes
+ if (UseAVX >= 2) {
+ vmovdqu(Address(base, 0), xmm10);
+ } else {
+ movdqu(Address(base, 0), xmm10);
+ movdqu(Address(base, 16), xmm10);
+ }
+ addptr(base, 32);
+ subptr(cnt, 4);
+
+ BIND(L_tail);
+ addptr(cnt, 4);
+ jccb(Assembler::lessEqual, L_end);
+ decrement(cnt);
+
+ BIND(L_sloop);
+ movptr(Address(base, 0), tmp);
+ addptr(base, 8);
+ decrement(cnt);
+ jccb(Assembler::greaterEqual, L_sloop);
+
+ BIND(L_end);
+ pop(base);
} else {
NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
for 32-bit VM
rep_stos();
When I use XMM0 as a temporary register, the micro-benchmark crashes.
Saving and Restoring the XMM0 register before and after use works
fine.
Looking at the "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
other XMM registers has been mentioned as Save-On-Call registers and
on Linux ABI, no register is preserved across function calls though
XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
saving/restoring should be fine.
Is it incorrect use XMM* registers without saving/restoring them?
Using XMM10 register as temporary register works fine without having
to save and restore it.
Please let me know your comments.
Regards,
Rohit
-------------- next part --------------
-----------------------------------------------------------------------------------------------------------------
|S.No | Array | | JDK11 trunk code ns/op | JDK11 trunk - ymm 64b loop ns/op |
| | Size | Total | | |
| | | Size |-----------------------------------|-----------------------------------------------------|
| | | | Const |variance|Variable|variance| Const |variance|Variable|variance|%dif Con|%dif var|
|-----|--------|--------|-----------------------------------|-----------------------------------------------------|
| 1 | 0 | 0 | 8.59 | 0.00 | 8.98 | 0.01 | 8.59 | 0.00 | 8.98 | 0.01 | 0.01% | -0.03% |
| 2 | 1 | 8 | 8.98 | 0.00 | 9.42 | 0.02 | 8.98 | 0.01 | 9.43 | 0.02 | 0.01% | -0.10% |
| 3 | 2 | 8 | 8.98 | 0.00 | 9.43 | 0.01 | 8.98 | 0.00 | 9.43 | 0.02 | 0.01% | -0.05% |
| 4 | 4 | 16 | 9.38 | 0.00 | 9.76 | 0.02 | 9.38 | 0.00 | 9.75 | 0.01 | 0.02% | 0.05% |
| 5 | 8 | 32 | 10.29 | 0.03 | 10.63 | 0.00 | 10.27 | 0.00 | 10.64 | 0.01 | 0.18% | -0.09% |
| 6 | 16 | 64 | 12.10 | 0.02 | 12.57 | 0.02 | 12.09 | 0.01 | 12.55 | 0.01 | 0.08% | 0.18% |
| 7 | 24 | 96 | 15.21 | 0.47 | 20.66 | 0.59 | 12.71 | 0.20 | 12.78 | 0.04 | 16.45% | 38.15% |<==
| 8 | 32 | 128 | 16.83 | 0.01 | 23.40 | 0.59 | 15.37 | 0.06 | 15.55 | 0.06 | 8.69% | 33.54% |
| 9 | 40 | 160 | 18.99 | 0.02 | 24.53 | 0.69 | 17.32 | 0.05 | 17.57 | 0.04 | 8.80% | 28.37% |
| 10 | 56 | 224 | 27.28 | 0.26 | 31.04 | 0.21 | 21.85 | 0.14 | 22.77 | 0.04 | 19.88% | 26.65% |
| 11 | 64 | 256 | 31.02 | 0.13 | 51.65 | 0.59 | 24.73 | 0.14 | 29.22 | 0.16 | 20.27% | 43.42% |
| 12 | 96 | 384 | 59.82 | 0.10 | 64.09 | 0.12 | 50.46 | 0.11 | 53.13 | 0.24 | 15.64% | 17.09% |
| 13 | 128 | 512 | 69.83 | 0.59 | 71.77 | 0.61 | 63.34 | 0.13 | 64.45 | 0.62 | 9.29% | 10.20% |<==
| 14 | 136 | 544 | 74.07 | 1.01 | 74.98 | 0.32 | 68.93 | 0.27 | 69.37 | 0.21 | 6.94% | 7.48% |
| 15 | 256 | 1 KB | 121.87 | 0.29 | 122.32 | 0.21 | 117.21 | 0.50 | 119.35 | 0.24 | 3.83% | 2.43% |
| 16 | 512 | 2 KB | 219.58 | 1.11 | 223.36 | 0.47 | 216.32 | 5.19 | 220.73 | 0.24 | 1.49% | 1.18% |
| 17 | 808 | 3 KB | 323.24 | 0.64 | 342.78 | 2.15 | 319.93 | 0.81 | 331.41 | 0.46 | 1.02% | 3.32% |
| 18 | 1024 | 4 KB | 421.69 | 0.85 | 451.06 | 1.22 | 398.47 | 0.60 | 430.78 | 0.75 | 5.51% | 4.50% |
| 19 | 2048 | 8 KB | 857.81 | 0.77 | 865.11 | 0.95 | 810.42 | 0.50 | 847.25 | 0.52 | 5.53% | 2.06% |
| 20 | 4096 | 16 KB |1612.11 | 5.13 |1613.38 | 2.29 |1583.33 | 7.34 |1598.82 | 1.90 | 1.79% | 0.90% |
| 21 | 8192 | 32 KB |3100.39 | 3.48 |3094.99 | 4.22 |3067.74 | 2.86 |3069.04 | 16.91 | 1.05% | 0.84% |
| 22 | 16384 | 64 KB |6059.48 | 10.50 |6073.39 | 8.60 |5978.82 | 4.18 |5971.72 | 5.98 | 1.33% | 1.67% |
| 23 | 32768 | 128 KB |12109.75| 29.16 |12178.25| 34.66 |11880.35| 8.51 |11861.76| 12.38 | 1.89% | 2.60% |
| 24 | 65536 | 256 KB |24303.89| 26.84 |24404.13| 37.95 |23606.47| 15.52 |23624.83| 39.52 | 2.87% | 3.19% |
| 25 | 131072 | 512 KB |49467.66| 95.65 |49216.41| 42.92 |48873.85| 70.28 |48599.38| 195.03 | 1.20% | 1.25% |
| 26 | 262144 | 1 MB |102971.2|3149.34 |102631.9|3168.90 |100962.7|3610.16 |100691.8|3528.52 | 1.95% | 1.89% |
| 27 | 524288 | 2 MB |223155.5| 286.52 |224287.9| 324.00 |223133.0| 283.44 |222802.6| 517.78 | 0.01% | 0.66% |
| 28 |1048576 | 4 MB |447718.2| 221.75 |447240.2| 430.55 |445605.1| 617.14 |440841.5| 323.20 | 0.47% | 1.43% |
| 29 |2097152 | 8 MB |891545.5| 968.99 |890070.5| 502.85 |888538.5| 775.27 |880552.1|2235.50 | 0.34% | 1.07% |
-----------------------------------------------------------------------------------------------------------------
More information about the hotspot-dev
mailing list