JEP 254, HotSpot, and optimizations
Andrew Haley
aph at redhat.com
Tue Mar 22 20:29:07 UTC 2016
I'm looking at compact strings for AArch64. I know that they are
intended to be implemented but HotSpot intrinsics, and the Java
methods are placeholders, but I was tempted to investigate: could you
write efficient implementations of the methods in pure Java?
Here's what I tried for StringLatin1::inflate:
while (srcPtr < endPtr) {
long bytes = U.getLongUnaligned(src, srcPtr);
srcPtr += 8;
long chars =
bytes << 56 >>> 56;
chars |= bytes << 48 >>> 56 << 16;
chars |= bytes << 40 >>> 56 << 32;
chars |= bytes << 32 >>> 56 << 48;
U.putLongUnaligned(dst, dstPtr, chars);
dstPtr += 8;
chars = bytes << 24 >>> 56;
chars |= bytes << 16 >>> 56 << 16;
chars |= bytes << 8 >>> 56 << 32;
chars |= bytes >>> 56 << 48;
U.putLongUnaligned(dst, dstPtr, chars);
dstPtr += 8;
}
and here's the inner loop generated by C2:
0x0000007fa8725de0: ldr x11, [x17,x3] ;*invokevirtual getLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
0x0000007fa8725de4: ubfx x12, x11, #8, #8
0x0000007fa8725de8: and x13, x11, #0xff
0x0000007fa8725dec: ubfx x14, x11, #16, #8
0x0000007fa8725df0: orr x12, x13, x12, lsl #16
0x0000007fa8725df4: ubfx x15, x11, #40, #8
0x0000007fa8725df8: ubfx x16, x11, #32, #8
0x0000007fa8725dfc: ubfx x13, x11, #48, #8
0x0000007fa8725e00: ubfx x18, x11, #24, #8
0x0000007fa8725e04: orr x12, x12, x14, lsl #32
0x0000007fa8725e08: orr x14, x16, x15, lsl #16
0x0000007fa8725e0c: lsr x11, x11, #56
0x0000007fa8725e10: orr x13, x14, x13, lsl #32
0x0000007fa8725e14: orr x12, x12, x18, lsl #48
0x0000007fa8725e18: orr x11, x13, x11, lsl #48
0x0000007fa8725e1c: add x13, x2, x4
0x0000007fa8725e20: str x12, [x13] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
0x0000007fa8725e24: str x11, [x13,#8] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
0x0000007fa8725e28: add x4, x4, #0x10 ;*goto {reexecute=0 rethrow=0 return_oop=0}
0x0000007fa8725e2c: add x3, x3, #0x8 ; ImmutableOopMap{r17=Oop c_rarg2=Oop }
0x0000007fa8725e30: ldr wzr, [x5] ;*goto {reexecute=0 rethrow=0 return_oop=0}
; {poll}
0x0000007fa8725e34: cmp x3, x6
0x0000007fa8725e38: b.lt 0x0000007fa8725de0 ;*ifge {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.StringLatin1::inflate at 61 (line 576)
This is pretty good code. (It's only little endian, but that's not
hard to fix.) I could not do any better writing this by hand unless I
used the vector processor. C1-generated code is worse than this, but
it's still not bad.
Perhaps it doesn't matter; perhaps we know there is no real point
trying to make the Java versions of these methods "efficient". We
know that the real goal is the intrinsics which use the vector
processor.
And one other thing: if we had simple primitives available as HotSpot
intrinsics to do a few simple vector pack and unpack operations we
wouldn't need to write all these hand-carved assembly language
String intrinsics.
Thoughts? Opinions?
Andrew.
More information about the hotspot-dev
mailing list