RFR: 8151502: aarch64: optimize pd_disjoint_words and pd_conjoint_words
Hi, Please review the following webrev http://cr.openjdk.java.net/~enevill/8151502/webrev/ This optimizes Copy::pd_disjoint_words and Copy::pd_conjoint_words using inline assembler. These routines are heavily used in GC and the aim is to improve the overall performance of GC. Tested in JMH using the following GCStress program. http://cr.openjdk.java.net/~enevill/8151502/JMHSample_97_GCStress.java JMH jar file: http://cr.openjdk.java.net/~enevill/8151502/benchmarks.jar The following are the results I get Original: /home/ed/images/jdk9-orig/bin/java -jar target/benchmarks.jar -i 5 -wi 5 -f 5 Result "gcstress": 24636979.087 ?(99.9%) 267838.773 us/op [Average] (min, avg, max) = (24102797.710, 24636979.087, 25372022.370), stdev = 357557.099 CI (99.9%): [24369140.314, 24904817.860] (assumes normal distribution) # Run complete. Total time: 00:20:55 Benchmark Mode Cnt Score Error Units JMHSample_97_GCStress.gcstress avgt 25 24636979.087 ? 267838.773 us/op --------------------------------------------------------------------------- Optimized: /home/ed/images/jdk9-test/bin/java -jar target/benchmarks.jar -i 5 -wi 5 -f 5 Result "gcstress": 20164420.762 ?(99.9%) 280305.425 us/op [Average] (min, avg, max) = (19738992.960, 20164420.762, 21137460.090), stdev = 374199.723 CI (99.9%): [19884115.337, 20444726.188] (assumes normal distribution) # Run complete. Total time: 00:17:06 Benchmark Mode Cnt Score Error Units JMHSample_97_GCStress.gcstress avgt 25 20164420.762 ? 280305.425 us/op This shows approx 22% performance improvement on this benchmark. I have also included a small bug fix to Array copy when using -XX:+UseSIMDForMemoryOps. I had fixed this previously, but somehow it fell out. All the best, Ed
On 03/09/2016 12:17 PM, Edward Nevill wrote:
http://cr.openjdk.java.net/~enevill/8151502/JMHSample_97_GCStress.java
JMH jar file: http://cr.openjdk.java.net/~enevill/8151502/benchmarks.jar
The following are the results I get
Not bad, but not quite perfect. But I guess you knew I'd say that. :-) The switch on count < threshold should be done in C, with multiple inline asm blocks. That way, GCC can do value range propagation for small copies. Also, GCC can do things like if (__builtin_constant_p(cnt)). There are some cases where cnt is a constant. We should be careful that we don't slow down slow such cases. GCC does this: 0x0000007fb774850c <+0>: adrp x2, 0x7fb7db1000 0x0000007fb7748510 <+4>: add x0, x2, #0xe20 0x0000007fb7748514 <+8>: ldr x5, [x0,#56] 0x0000007fb7748518 <+12>: ldr x4, [x0,#48] 0x0000007fb774851c <+16>: ldr x3, [x0,#40] 0x0000007fb7748520 <+20>: ldr x1, [x0,#32] 0x0000007fb7748524 <+24>: str x5, [x0,#24] 0x0000007fb7748528 <+28>: str x4, [x0,#16] 0x0000007fb774852c <+32>: str x3, [x0,#8] 0x0000007fb7748530 <+36>: str x1, [x2,#3616] for this: HeapWord blah[4]; HeapWord blah2[4]; void bletch() { Copy::disjoint_words(blah, blah2, sizeof blah / sizeof blah[0]); } Finally, GCC has __builtin_expect(bool). We should use that to emit the large copy and the backwards copy out of line. Finally, GCC knows that copying from one object to the other copies the contents. It can do copy propagation. I think that this change can be done with no performance regressions, either in code size or speed, for any range of arguments. Andrew.
On Wed, 2016-03-09 at 12:57 +0000, Andrew Haley wrote:
On 03/09/2016 12:17 PM, Edward Nevill wrote:
http://cr.openjdk.java.net/~enevill/8151502/JMHSample_97_GCStress.java
JMH jar file: http://cr.openjdk.java.net/~enevill/8151502/benchmarks.jar
The following are the results I get
Not bad, but not quite perfect. But I guess you knew I'd say that. :-)
The switch on count < threshold should be done in C, with multiple inline asm blocks. That way, GCC can do value range propagation for small copies.
Hmm. I did try using switch on my first stab at this but gave up when I got the following output for this simple test program (this is with stock gcc 5.2). --- CUT HERE --- unsigned long test_switch(unsigned long /*aka size_t*/ i) { switch (i) { case 0: return 1; case 1: return i+2; case 2: return i; case 3: return i-10; case 4: return 11; case 5: return 16; default: return -1; } } --- CUT HERE --- generates ---------------- test_switch: cmp x0, 5 bhi .L2 cmp w0, 5 <<<<< REALLY bls .L12 .L2: mov x0, -1 ret .p2align 3 .L12: adrp x1, .L4 add x1, x1, :lo12:.L4 ldrb w0, [x1,w0,uxtw] <<<<< REALLY adr x1, .Lrtx4 add x0, x1, w0, sxtb #2 br x0 ---------------- I've filed a bug report. I could maybe do a gcc goto table, but would that fool value range propagation?
Also, GCC can do things like if (__builtin_constant_p(cnt)). There are some cases where cnt is a constant. We should be careful that we don't slow down slow such cases.
Agreed.
void bletch() { Copy::disjoint_words(blah, blah2, sizeof blah / sizeof blah[0]); }
Finally, GCC has __builtin_expect(bool). We should use that to emit the large copy and the backwards copy out of line.
I did initially do the large copy out of line. My concern was that the register allocator wouldn't handle the two paths and would treat x0..x18 as corrupted on both paths, whereas the inline version 'only' uses 11 registers. All the best, Ed.
On 03/09/2016 02:07 PM, Edward Nevill wrote:
On Wed, 2016-03-09 at 12:57 +0000, Andrew Haley wrote:
On 03/09/2016 12:17 PM, Edward Nevill wrote:
http://cr.openjdk.java.net/~enevill/8151502/JMHSample_97_GCStress.java
JMH jar file: http://cr.openjdk.java.net/~enevill/8151502/benchmarks.jar
The following are the results I get
Not bad, but not quite perfect. But I guess you knew I'd say that. :-)
The switch on count < threshold should be done in C, with multiple inline asm blocks. That way, GCC can do value range propagation for small copies.
Hmm. I did try using switch on my first stab at this but gave up when I got the following output for this simple test program (this is with stock gcc 5.2).
I was just thinking of if (cnt < 8) small() else large();
void bletch() { Copy::disjoint_words(blah, blah2, sizeof blah / sizeof blah[0]); }
Finally, GCC has __builtin_expect(bool). We should use that to emit the large copy and the backwards copy out of line.
I did initially do the large copy out of line. My concern was that the register allocator wouldn't handle the two paths and would treat x0..x18 as corrupted on both paths, whereas the inline version 'only' uses 11 registers.
You can do the pushing and popping yourself. Andrew.
On Wed, 2016-03-09 at 14:11 +0000, Andrew Haley wrote:
On 03/09/2016 02:07 PM, Edward Nevill wrote:
On Wed, 2016-03-09 at 12:57 +0000, Andrew Haley wrote:
On 03/09/2016 12:17 PM, Edward Nevill wrote:
http://cr.openjdk.java.net/~enevill/8151502/JMHSample_97_GCStress.java
JMH jar file: http://cr.openjdk.java.net/~enevill/8151502/benchmarks.jar
The following are the results I get
Not bad, but not quite perfect. But I guess you knew I'd say that. :-)
I'll settle for good enough! What about this one? http://cr.openjdk.java.net/~enevill/8151502/webrev.3 So this gets the following from GCStress Benchmark Mode Cnt Score Error Units JMHSample_97_GCStress.gcstress avgt 25 20171328.764 ? 284468.532 us/op Whereas previously the best was Benchmark Mode Cnt Score Error Units JMHSample_97_GCStress.gcstress avgt 25 20164420.762 ? 280305.425 us/op IE. No significant difference. But it does inline less code and handles the case where count is a constant. All the best, Ed.
participants (2)
-
Andrew Haley
-
Edward Nevill