JEP 254, HotSpot, and optimizations

Tue Mar 22 20:29:07 UTC 2016

I'm looking at compact strings for AArch64.  I know that they are
intended to be implemented but HotSpot intrinsics, and the Java
methods are placeholders, but I was tempted to investigate: could you
write efficient implementations of the methods in pure Java?

Here's what I tried for StringLatin1::inflate:

        while (srcPtr < endPtr) {
            long bytes = U.getLongUnaligned(src, srcPtr);
            srcPtr += 8;

            long chars =
                     bytes << 56 >>> 56;
            chars |= bytes << 48 >>> 56 << 16;
            chars |= bytes << 40 >>> 56 << 32;
            chars |= bytes << 32 >>> 56 << 48;

            U.putLongUnaligned(dst, dstPtr, chars);
            dstPtr += 8;

            chars =  bytes << 24 >>> 56;
            chars |= bytes << 16 >>> 56 << 16;
            chars |= bytes <<  8 >>> 56 << 32;
            chars |= bytes       >>> 56 << 48;

            U.putLongUnaligned(dst, dstPtr, chars);
            dstPtr += 8;
        }

and here's the inner loop generated by C2:

  0x0000007fa8725de0: ldr	x11, [x17,x3]   ;*invokevirtual getLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
  0x0000007fa8725de4: ubfx	x12, x11, #8, #8
  0x0000007fa8725de8: and	x13, x11, #0xff
  0x0000007fa8725dec: ubfx	x14, x11, #16, #8
  0x0000007fa8725df0: orr	x12, x13, x12, lsl #16
  0x0000007fa8725df4: ubfx	x15, x11, #40, #8
  0x0000007fa8725df8: ubfx	x16, x11, #32, #8
  0x0000007fa8725dfc: ubfx	x13, x11, #48, #8
  0x0000007fa8725e00: ubfx	x18, x11, #24, #8
  0x0000007fa8725e04: orr	x12, x12, x14, lsl #32
  0x0000007fa8725e08: orr	x14, x16, x15, lsl #16
  0x0000007fa8725e0c: lsr	x11, x11, #56
  0x0000007fa8725e10: orr	x13, x14, x13, lsl #32
  0x0000007fa8725e14: orr	x12, x12, x18, lsl #48
  0x0000007fa8725e18: orr	x11, x13, x11, lsl #48
  0x0000007fa8725e1c: add	x13, x2, x4
  0x0000007fa8725e20: str	x12, [x13]      ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
  0x0000007fa8725e24: str	x11, [x13,#8]   ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
  0x0000007fa8725e28: add	x4, x4, #0x10   ;*goto {reexecute=0 rethrow=0 return_oop=0}
  0x0000007fa8725e2c: add	x3, x3, #0x8    ; ImmutableOopMap{r17=Oop c_rarg2=Oop }
  0x0000007fa8725e30: ldr	wzr, [x5]       ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                ;   {poll}
  0x0000007fa8725e34: cmp	x3, x6
  0x0000007fa8725e38: b.lt	0x0000007fa8725de0  ;*ifge {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.StringLatin1::inflate at 61 (line 576)

This is pretty good code.  (It's only little endian, but that's not
hard to fix.)  I could not do any better writing this by hand unless I
used the vector processor.  C1-generated code is worse than this, but
it's still not bad.

Perhaps it doesn't matter; perhaps we know there is no real point
trying to make the Java versions of these methods "efficient".  We
know that the real goal is the intrinsics which use the vector
processor.

And one other thing: if we had simple primitives available as HotSpot
intrinsics to do a few simple vector pack and unpack operations we
wouldn't need to write all these hand-carved assembly language
String intrinsics.

Thoughts?  Opinions?

Andrew.