RFD: C2: Poor code quality for Unsafe

Wed Jul 27 12:49:16 UTC 2016

Summary: The code we generate for heap-based X-Buffers is good, but
for off-heap X-Buffers it can be bad.  Quite startlingly so.

[This is AArch64 code, but Intel is very similar.  I've been working
on AArch64 so much that Intel assembler looks like line noise to me.
I will produce x86 examples if anyone wants them.]

Consider something like

    void floss(IntBuffer b, int n) {
        for (int i = 0; i < b.capacity(); i++) {
            b.put(i, n);
        }
    }

If the byte buffer is allocated like this:

    ByteBuffer buf = ByteBuffer.allocate(SIZE * 4);
    IntBuffer ibuf;
    buf.order(ByteOrder.LITTLE_ENDIAN);
    ibuf = buf.asIntBuffer();

we get nice code, sweetly unrolled:

  0x0000007fa127cd30: str	w3, [x21,w1,sxtw #2]
  0x0000007fa127cd34: str	w3, [x13,w1,sxtw #2]
  0x0000007fa127cd38: str	w3, [x14,w1,sxtw #2]
  0x0000007fa127cd3c: str	w3, [x15,w1,sxtw #2]
  0x0000007fa127cd40: str	w3, [x16,w1,sxtw #2]
  0x0000007fa127cd44: str	w3, [x17,w1,sxtw #2]
  ...

But if we allocate it as a direct buffer, like this:

    ByteBuffer buf = ByteBuffer.allocateDirect(SIZE * 4);

bad things happen:

  0x0000007fa526f060: add	w12, w11, #0xf
  0x0000007fa526f064: add	w15, w11, #0xe
  0x0000007fa526f068: sbfiz	x12, x12, #2, #32
  0x0000007fa526f06c: add	x12, x12, x16
  0x0000007fa526f070: sbfiz	x14, x15, #2, #32
  0x0000007fa526f074: add	x14, x14, x16
  0x0000007fa526f078: mov	x18, x12
  0x0000007fa526f07c: mov	x0, x14
  0x0000007fa526f080: sbfiz	x12, x11, #2, #32
  0x0000007fa526f084: add	x12, x12, x16
  0x0000007fa526f088: add	w14, w11, #0x1
  0x0000007fa526f08c: str	w3, [x12]
  0x0000007fa526f090: sbfiz	x12, x14, #2, #32
  0x0000007fa526f094: add	x12, x12, x16
  0x0000007fa526f098: add	w15, w11, #0x2
  0x0000007fa526f09c: str	w3, [x12]
  ...

As you can see, all of the address calculation is performed as
arithmetic, and the loop contains more than three times as many
instructions.

[ Note: if we inline floss() in a caller method which knows a bit more
about the types of the arguments and the fact that we keep using the
same array we get much better code than this, back to something which
is nearly optimal.  But I'd argue that this is rather like rolling the
dice and hoping for the best. ]

So what's going on here?

In a Direct-X-Buffer the address calculation is done as a long:

    private long ix(int i) {
        return address + ((long)i << $LG_BYTES_PER_VALUE$);
    }

but in a ByteBufferAs-X-Buffer it's an int:

    protected int ix(int i) {
        return (i << $LG_BYTES_PER_VALUE$) + offset;
    }

The reason for this optimization being missed is that in
CastX2PNode::Ideal we convert CastX2P(AddX(x, y)) to AddP(CastX2P(x),
y) if y fits in an int.  The logic looks like this:

    if (fits_in_int(phase->type(y)))
      {
      return addP_of_X2P(phase, x, y);
    }
    if (fits_in_int(phase->type(x)))
      {
      return addP_of_X2P(phase, y, x);
    }

I guess the idea here is that when we're indexing a pointer (encoded
as a long) and an int, we always arrange things so that the pointer is
on the left and we get a+n.  But in the case where neither x nor y
fits in an int we're not going to rearrange CastX2P(AddX(x, y)).

Nothing in C2 recognizes any of this stuff, and we're going to keep
the CastX2P nodes in the IR all the way to the back end.

This seems to be easy enough to fix, at least for Direct-X-Buffers,
like this, by commenting out a fits_in_int() test:

    if (fits_in_int(phase->type(y)))
      {
      return addP_of_X2P(phase, x, y);
    }
    // if (fits_in_int(phase->type(x)))
      {
      return addP_of_X2P(phase, y, x);
    }

we then generate:

  0x0000007f8527c260: str	w3, [x15,w16,sxtw #2]
  0x0000007f8527c264: add	w12, w16, #0x1
  0x0000007f8527c268: str	w3, [x15,w12,sxtw #2]
  0x0000007f8527c26c: add	w13, w16, #0x2
  0x0000007f8527c270: str	w3, [x15,w13,sxtw #2]
  0x0000007f8527c274: add	w12, w16, #0x3
  0x0000007f8527c278: str	w3, [x15,w12,sxtw #2]
  0x0000007f8527c27c: add	w13, w16, #0x4
  0x0000007f8527c280: str	w3, [x15,w13,sxtw #2]
  0x0000007f8527c284: add	w12, w16, #0x5
  ...

This code isn't perfect but it's a heckuva lot better than we generate
today.  It's arbitrary to pick either x or y, but picking either is an
improvement.  Maybe we could do something smarter: if one of the
arguments to the add is a shift, the other must be a base pointer.

Thoughts?  Does anyone here know the real reason why
CastX2PNode::Ideal() only canonicalizes itself if one of its arguments
fits in an int?

Andrew.

// Run with option dontinline ByteBufferTest::floss

import java.nio.*;

public class ByteBufferTest {

    int SIZE = 1024;

    ByteBuffer buf = ByteBuffer.allocateDirect(SIZE * 4);
    IntBuffer ibuf;

    ByteBufferTest() {
        buf.order(ByteOrder.LITTLE_ENDIAN);
        ibuf = buf.asIntBuffer();
    }

    static private final ByteBufferTest test = new ByteBufferTest();

    void floss(IntBuffer b, int n) {
        for (int i = 0; i < b.capacity(); i++) {
            b.put(i, n);
        }
    }

    public static void main(String[] args) {
        for (int i = 0; i < 100000; i++)
            test.floss(test.ibuf, i);
    }
}