[aarch64-port-dev ] RFR: 8151502: aarch64: optimize pd_disjoint_words and pd_conjoint_words
Edward Nevill
edward.nevill at gmail.com
Wed Mar 9 14:07:12 UTC 2016
On Wed, 2016-03-09 at 12:57 +0000, Andrew Haley wrote:
> On 03/09/2016 12:17 PM, Edward Nevill wrote:
> > http://cr.openjdk.java.net/~enevill/8151502/JMHSample_97_GCStress.java
> >
> > JMH jar file: http://cr.openjdk.java.net/~enevill/8151502/benchmarks.jar
> >
> > The following are the results I get
>
> Not bad, but not quite perfect. But I guess you knew I'd say that.
> :-)
>
> The switch on count < threshold should be done in C, with multiple
> inline asm blocks. That way, GCC can do value range propagation for
> small copies.
Hmm. I did try using switch on my first stab at this but gave up when I got the following output for this simple test program (this is with stock gcc 5.2).
--- CUT HERE ---
unsigned long test_switch(unsigned long /*aka size_t*/ i)
{
switch (i) {
case 0: return 1;
case 1: return i+2;
case 2: return i;
case 3: return i-10;
case 4: return 11;
case 5: return 16;
default: return -1;
}
}
--- CUT HERE ---
generates
----------------
test_switch:
cmp x0, 5
bhi .L2
cmp w0, 5 <<<<< REALLY
bls .L12
.L2:
mov x0, -1
ret
.p2align 3
.L12:
adrp x1, .L4
add x1, x1, :lo12:.L4
ldrb w0, [x1,w0,uxtw] <<<<< REALLY
adr x1, .Lrtx4
add x0, x1, w0, sxtb #2
br x0
----------------
I've filed a bug report.
I could maybe do a gcc goto table, but would that fool value range propagation?
>
> Also, GCC can do things like if (__builtin_constant_p(cnt)). There
> are some cases where cnt is a constant. We should be careful that we
> don't slow down slow such cases.
Agreed.
> void bletch() {
> Copy::disjoint_words(blah, blah2, sizeof blah / sizeof blah[0]);
> }
>
> Finally, GCC has __builtin_expect(bool). We should use that to emit
> the large copy and the backwards copy out of line.
I did initially do the large copy out of line. My concern was that the register allocator wouldn't handle the two paths and would treat x0..x18 as corrupted on both paths, whereas the inline version 'only' uses 11 registers.
All the best,
Ed.
More information about the aarch64-port-dev
mailing list