[aarch64-port-dev ] RFR: 8151502: aarch64: optimize pd_disjoint_words and pd_conjoint_words

Edward Nevill edward.nevill at gmail.com
Wed Mar 9 14:07:12 UTC 2016


On Wed, 2016-03-09 at 12:57 +0000, Andrew Haley wrote:
> On 03/09/2016 12:17 PM, Edward Nevill wrote:
> > http://cr.openjdk.java.net/~enevill/8151502/JMHSample_97_GCStress.java
> > 
> > JMH jar file: http://cr.openjdk.java.net/~enevill/8151502/benchmarks.jar
> > 
> > The following are the results I get
> 
> Not bad, but not quite perfect.  But I guess you knew I'd say that.
> :-)
> 
> The switch on count < threshold should be done in C, with multiple
> inline asm blocks.  That way, GCC can do value range propagation for
> small copies.

Hmm. I did try using switch on my first stab at this but gave up when I got the following output for this simple test program (this is with stock gcc 5.2).

--- CUT HERE ---

unsigned long test_switch(unsigned long /*aka size_t*/ i)
{
        switch (i) {
                case 0: return 1;
                case 1: return i+2;
                case 2: return i;
                case 3: return i-10;
                case 4: return 11;
                case 5: return 16;
                default: return -1;
        }
}
--- CUT HERE ---

generates

----------------
test_switch:
        cmp     x0, 5
        bhi     .L2
        cmp     w0, 5    <<<<< REALLY
        bls     .L12
.L2:
        mov     x0, -1
        ret
        .p2align 3
.L12:   
        adrp    x1, .L4
        add     x1, x1, :lo12:.L4
        ldrb    w0, [x1,w0,uxtw]  <<<<< REALLY
        adr     x1, .Lrtx4
        add     x0, x1, w0, sxtb #2
        br      x0
----------------

I've filed a bug report.

I could maybe do a gcc goto table, but would that fool value range propagation?

> 
> Also, GCC can do things like if (__builtin_constant_p(cnt)).  There
> are some cases where cnt is a constant.  We should be careful that we
> don't slow down slow such cases.

Agreed.


> void bletch() {
>   Copy::disjoint_words(blah, blah2, sizeof blah / sizeof blah[0]);
> }
> 
> Finally, GCC has __builtin_expect(bool).  We should use that to emit
> the large copy and the backwards copy out of line.

I did initially do the large copy out of line. My concern was that the register allocator wouldn't handle the two paths and would treat x0..x18 as corrupted on both paths, whereas the inline version 'only' uses 11 registers.

All the best,
Ed.




More information about the aarch64-port-dev mailing list