RFD: C2: Poor code quality for Unsafe
Paul Sandoz
paul.sandoz at oracle.com
Wed Jul 27 12:57:18 UTC 2016
Hi Andrew,
What JDK build are you using? dev, hs, or hs-comp?
Would it be possible to reevaluate with a fix for a bug i introduced switching ByteBufferAs-X-Buffer to use Unsafe?
http://mail.openjdk.java.net/pipermail/hotspot-dev/2016-July/024007.html <http://mail.openjdk.java.net/pipermail/hotspot-dev/2016-July/024007.html>
Thanks,
Paul.
> On 27 Jul 2016, at 14:49, Andrew Haley <aph at redhat.com> wrote:
>
> Summary: The code we generate for heap-based X-Buffers is good, but
> for off-heap X-Buffers it can be bad. Quite startlingly so.
>
> [This is AArch64 code, but Intel is very similar. I've been working
> on AArch64 so much that Intel assembler looks like line noise to me.
> I will produce x86 examples if anyone wants them.]
>
> Consider something like
>
> void floss(IntBuffer b, int n) {
> for (int i = 0; i < b.capacity(); i++) {
> b.put(i, n);
> }
> }
>
> If the byte buffer is allocated like this:
>
> ByteBuffer buf = ByteBuffer.allocate(SIZE * 4);
> IntBuffer ibuf;
> buf.order(ByteOrder.LITTLE_ENDIAN);
> ibuf = buf.asIntBuffer();
>
> we get nice code, sweetly unrolled:
>
> 0x0000007fa127cd30: str w3, [x21,w1,sxtw #2]
> 0x0000007fa127cd34: str w3, [x13,w1,sxtw #2]
> 0x0000007fa127cd38: str w3, [x14,w1,sxtw #2]
> 0x0000007fa127cd3c: str w3, [x15,w1,sxtw #2]
> 0x0000007fa127cd40: str w3, [x16,w1,sxtw #2]
> 0x0000007fa127cd44: str w3, [x17,w1,sxtw #2]
> ...
>
> But if we allocate it as a direct buffer, like this:
>
> ByteBuffer buf = ByteBuffer.allocateDirect(SIZE * 4);
>
> bad things happen:
>
> 0x0000007fa526f060: add w12, w11, #0xf
> 0x0000007fa526f064: add w15, w11, #0xe
> 0x0000007fa526f068: sbfiz x12, x12, #2, #32
> 0x0000007fa526f06c: add x12, x12, x16
> 0x0000007fa526f070: sbfiz x14, x15, #2, #32
> 0x0000007fa526f074: add x14, x14, x16
> 0x0000007fa526f078: mov x18, x12
> 0x0000007fa526f07c: mov x0, x14
> 0x0000007fa526f080: sbfiz x12, x11, #2, #32
> 0x0000007fa526f084: add x12, x12, x16
> 0x0000007fa526f088: add w14, w11, #0x1
> 0x0000007fa526f08c: str w3, [x12]
> 0x0000007fa526f090: sbfiz x12, x14, #2, #32
> 0x0000007fa526f094: add x12, x12, x16
> 0x0000007fa526f098: add w15, w11, #0x2
> 0x0000007fa526f09c: str w3, [x12]
> ...
>
> As you can see, all of the address calculation is performed as
> arithmetic, and the loop contains more than three times as many
> instructions.
>
> [ Note: if we inline floss() in a caller method which knows a bit more
> about the types of the arguments and the fact that we keep using the
> same array we get much better code than this, back to something which
> is nearly optimal. But I'd argue that this is rather like rolling the
> dice and hoping for the best. ]
>
> So what's going on here?
>
> In a Direct-X-Buffer the address calculation is done as a long:
>
> private long ix(int i) {
> return address + ((long)i << $LG_BYTES_PER_VALUE$);
> }
>
> but in a ByteBufferAs-X-Buffer it's an int:
>
> protected int ix(int i) {
> return (i << $LG_BYTES_PER_VALUE$) + offset;
> }
>
> The reason for this optimization being missed is that in
> CastX2PNode::Ideal we convert CastX2P(AddX(x, y)) to AddP(CastX2P(x),
> y) if y fits in an int. The logic looks like this:
>
> if (fits_in_int(phase->type(y)))
> {
> return addP_of_X2P(phase, x, y);
> }
> if (fits_in_int(phase->type(x)))
> {
> return addP_of_X2P(phase, y, x);
> }
>
> I guess the idea here is that when we're indexing a pointer (encoded
> as a long) and an int, we always arrange things so that the pointer is
> on the left and we get a+n. But in the case where neither x nor y
> fits in an int we're not going to rearrange CastX2P(AddX(x, y)).
>
> Nothing in C2 recognizes any of this stuff, and we're going to keep
> the CastX2P nodes in the IR all the way to the back end.
>
> This seems to be easy enough to fix, at least for Direct-X-Buffers,
> like this, by commenting out a fits_in_int() test:
>
> if (fits_in_int(phase->type(y)))
> {
> return addP_of_X2P(phase, x, y);
> }
> // if (fits_in_int(phase->type(x)))
> {
> return addP_of_X2P(phase, y, x);
> }
>
> we then generate:
>
> 0x0000007f8527c260: str w3, [x15,w16,sxtw #2]
> 0x0000007f8527c264: add w12, w16, #0x1
> 0x0000007f8527c268: str w3, [x15,w12,sxtw #2]
> 0x0000007f8527c26c: add w13, w16, #0x2
> 0x0000007f8527c270: str w3, [x15,w13,sxtw #2]
> 0x0000007f8527c274: add w12, w16, #0x3
> 0x0000007f8527c278: str w3, [x15,w12,sxtw #2]
> 0x0000007f8527c27c: add w13, w16, #0x4
> 0x0000007f8527c280: str w3, [x15,w13,sxtw #2]
> 0x0000007f8527c284: add w12, w16, #0x5
> ...
>
> This code isn't perfect but it's a heckuva lot better than we generate
> today. It's arbitrary to pick either x or y, but picking either is an
> improvement. Maybe we could do something smarter: if one of the
> arguments to the add is a shift, the other must be a base pointer.
>
> Thoughts? Does anyone here know the real reason why
> CastX2PNode::Ideal() only canonicalizes itself if one of its arguments
> fits in an int?
>
> Andrew.
>
>
> // Run with option dontinline ByteBufferTest::floss
>
> import java.nio.*;
>
> public class ByteBufferTest {
>
> int SIZE = 1024;
>
> ByteBuffer buf = ByteBuffer.allocateDirect(SIZE * 4);
> IntBuffer ibuf;
>
> ByteBufferTest() {
> buf.order(ByteOrder.LITTLE_ENDIAN);
> ibuf = buf.asIntBuffer();
> }
>
> static private final ByteBufferTest test = new ByteBufferTest();
>
> void floss(IntBuffer b, int n) {
> for (int i = 0; i < b.capacity(); i++) {
> b.put(i, n);
> }
> }
>
> public static void main(String[] args) {
> for (int i = 0; i < 100000; i++)
> test.floss(test.ibuf, i);
> }
> }
>
More information about the hotspot-dev
mailing list