RFD: C2: Poor code quality for Unsafe

Wed Jul 27 12:57:18 UTC 2016

Hi Andrew,

What JDK build are you using? dev, hs, or hs-comp?

Would it be possible to reevaluate with a fix for a bug i introduced switching ByteBufferAs-X-Buffer to use Unsafe?

  http://mail.openjdk.java.net/pipermail/hotspot-dev/2016-July/024007.html <http://mail.openjdk.java.net/pipermail/hotspot-dev/2016-July/024007.html>

Thanks,
Paul.

> On 27 Jul 2016, at 14:49, Andrew Haley <aph at redhat.com> wrote:
> 
> Summary: The code we generate for heap-based X-Buffers is good, but
> for off-heap X-Buffers it can be bad.  Quite startlingly so.
> 
> [This is AArch64 code, but Intel is very similar.  I've been working
> on AArch64 so much that Intel assembler looks like line noise to me.
> I will produce x86 examples if anyone wants them.]
> 
> Consider something like
> 
>    void floss(IntBuffer b, int n) {
>        for (int i = 0; i < b.capacity(); i++) {
>            b.put(i, n);
>        }
>    }
> 
> If the byte buffer is allocated like this:
> 
>    ByteBuffer buf = ByteBuffer.allocate(SIZE * 4);
>    IntBuffer ibuf;
>    buf.order(ByteOrder.LITTLE_ENDIAN);
>    ibuf = buf.asIntBuffer();
> 
> we get nice code, sweetly unrolled:
> 
>  0x0000007fa127cd30: str	w3, [x21,w1,sxtw #2]
>  0x0000007fa127cd34: str	w3, [x13,w1,sxtw #2]
>  0x0000007fa127cd38: str	w3, [x14,w1,sxtw #2]
>  0x0000007fa127cd3c: str	w3, [x15,w1,sxtw #2]
>  0x0000007fa127cd40: str	w3, [x16,w1,sxtw #2]
>  0x0000007fa127cd44: str	w3, [x17,w1,sxtw #2]
>  ...
> 
> But if we allocate it as a direct buffer, like this:
> 
>    ByteBuffer buf = ByteBuffer.allocateDirect(SIZE * 4);
> 
> bad things happen:
> 
>  0x0000007fa526f060: add	w12, w11, #0xf
>  0x0000007fa526f064: add	w15, w11, #0xe
>  0x0000007fa526f068: sbfiz	x12, x12, #2, #32
>  0x0000007fa526f06c: add	x12, x12, x16
>  0x0000007fa526f070: sbfiz	x14, x15, #2, #32
>  0x0000007fa526f074: add	x14, x14, x16
>  0x0000007fa526f078: mov	x18, x12
>  0x0000007fa526f07c: mov	x0, x14
>  0x0000007fa526f080: sbfiz	x12, x11, #2, #32
>  0x0000007fa526f084: add	x12, x12, x16
>  0x0000007fa526f088: add	w14, w11, #0x1
>  0x0000007fa526f08c: str	w3, [x12]
>  0x0000007fa526f090: sbfiz	x12, x14, #2, #32
>  0x0000007fa526f094: add	x12, x12, x16
>  0x0000007fa526f098: add	w15, w11, #0x2
>  0x0000007fa526f09c: str	w3, [x12]
>  ...
> 
> As you can see, all of the address calculation is performed as
> arithmetic, and the loop contains more than three times as many
> instructions.
> 
> [ Note: if we inline floss() in a caller method which knows a bit more
> about the types of the arguments and the fact that we keep using the
> same array we get much better code than this, back to something which
> is nearly optimal.  But I'd argue that this is rather like rolling the
> dice and hoping for the best. ]
> 
> So what's going on here?
> 
> In a Direct-X-Buffer the address calculation is done as a long:
> 
>    private long ix(int i) {
>        return address + ((long)i << $LG_BYTES_PER_VALUE$);
>    }
> 
> but in a ByteBufferAs-X-Buffer it's an int:
> 
>    protected int ix(int i) {
>        return (i << $LG_BYTES_PER_VALUE$) + offset;
>    }
> 
> The reason for this optimization being missed is that in
> CastX2PNode::Ideal we convert CastX2P(AddX(x, y)) to AddP(CastX2P(x),
> y) if y fits in an int.  The logic looks like this:
> 
>    if (fits_in_int(phase->type(y)))
>      {
>      return addP_of_X2P(phase, x, y);
>    }
>    if (fits_in_int(phase->type(x)))
>      {
>      return addP_of_X2P(phase, y, x);
>    }
> 
> I guess the idea here is that when we're indexing a pointer (encoded
> as a long) and an int, we always arrange things so that the pointer is
> on the left and we get a+n.  But in the case where neither x nor y
> fits in an int we're not going to rearrange CastX2P(AddX(x, y)).
> 
> Nothing in C2 recognizes any of this stuff, and we're going to keep
> the CastX2P nodes in the IR all the way to the back end.
> 
> This seems to be easy enough to fix, at least for Direct-X-Buffers,
> like this, by commenting out a fits_in_int() test:
> 
>    if (fits_in_int(phase->type(y)))
>      {
>      return addP_of_X2P(phase, x, y);
>    }
>    // if (fits_in_int(phase->type(x)))
>      {
>      return addP_of_X2P(phase, y, x);
>    }
> 
> we then generate:
> 
>  0x0000007f8527c260: str	w3, [x15,w16,sxtw #2]
>  0x0000007f8527c264: add	w12, w16, #0x1
>  0x0000007f8527c268: str	w3, [x15,w12,sxtw #2]
>  0x0000007f8527c26c: add	w13, w16, #0x2
>  0x0000007f8527c270: str	w3, [x15,w13,sxtw #2]
>  0x0000007f8527c274: add	w12, w16, #0x3
>  0x0000007f8527c278: str	w3, [x15,w12,sxtw #2]
>  0x0000007f8527c27c: add	w13, w16, #0x4
>  0x0000007f8527c280: str	w3, [x15,w13,sxtw #2]
>  0x0000007f8527c284: add	w12, w16, #0x5
>  ...
> 
> This code isn't perfect but it's a heckuva lot better than we generate
> today.  It's arbitrary to pick either x or y, but picking either is an
> improvement.  Maybe we could do something smarter: if one of the
> arguments to the add is a shift, the other must be a base pointer.
> 
> Thoughts?  Does anyone here know the real reason why
> CastX2PNode::Ideal() only canonicalizes itself if one of its arguments
> fits in an int?
> 
> Andrew.
> 
> 
> // Run with option dontinline ByteBufferTest::floss
> 
> import java.nio.*;
> 
> public class ByteBufferTest {
> 
>    int SIZE = 1024;
> 
>    ByteBuffer buf = ByteBuffer.allocateDirect(SIZE * 4);
>    IntBuffer ibuf;
> 
>    ByteBufferTest() {
>        buf.order(ByteOrder.LITTLE_ENDIAN);
>        ibuf = buf.asIntBuffer();
>    }
> 
>    static private final ByteBufferTest test = new ByteBufferTest();
> 
>    void floss(IntBuffer b, int n) {
>        for (int i = 0; i < b.capacity(); i++) {
>            b.put(i, n);
>        }
>    }
> 
>    public static void main(String[] args) {
>        for (int i = 0; i < 100000; i++)
>            test.floss(test.ibuf, i);
>    }
> }
>