JEP 254, HotSpot, and optimizations

Tue Mar 22 20:59:30 UTC 2016

Value types that defines a vectorized operation API is all you need.

Rémi

Le 22 mars 2016 21:29:07 CET, Andrew Haley <aph at redhat.com> a écrit :
>I'm looking at compact strings for AArch64.  I know that they are
>intended to be implemented but HotSpot intrinsics, and the Java
>methods are placeholders, but I was tempted to investigate: could you
>write efficient implementations of the methods in pure Java?
>
>Here's what I tried for StringLatin1::inflate:
>
>        while (srcPtr < endPtr) {
>            long bytes = U.getLongUnaligned(src, srcPtr);
>            srcPtr += 8;
>
>            long chars =
>                     bytes << 56 >>> 56;
>            chars |= bytes << 48 >>> 56 << 16;
>            chars |= bytes << 40 >>> 56 << 32;
>            chars |= bytes << 32 >>> 56 << 48;
>
>            U.putLongUnaligned(dst, dstPtr, chars);
>            dstPtr += 8;
>
>            chars =  bytes << 24 >>> 56;
>            chars |= bytes << 16 >>> 56 << 16;
>            chars |= bytes <<  8 >>> 56 << 32;
>            chars |= bytes       >>> 56 << 48;
>
>            U.putLongUnaligned(dst, dstPtr, chars);
>            dstPtr += 8;
>        }
>
>
>and here's the inner loop generated by C2:
>
>0x0000007fa8725de0: ldr	x11, [x17,x3]   ;*invokevirtual
>getLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
>  0x0000007fa8725de4: ubfx	x12, x11, #8, #8
>  0x0000007fa8725de8: and	x13, x11, #0xff
>  0x0000007fa8725dec: ubfx	x14, x11, #16, #8
>  0x0000007fa8725df0: orr	x12, x13, x12, lsl #16
>  0x0000007fa8725df4: ubfx	x15, x11, #40, #8
>  0x0000007fa8725df8: ubfx	x16, x11, #32, #8
>  0x0000007fa8725dfc: ubfx	x13, x11, #48, #8
>  0x0000007fa8725e00: ubfx	x18, x11, #24, #8
>  0x0000007fa8725e04: orr	x12, x12, x14, lsl #32
>  0x0000007fa8725e08: orr	x14, x16, x15, lsl #16
>  0x0000007fa8725e0c: lsr	x11, x11, #56
>  0x0000007fa8725e10: orr	x13, x14, x13, lsl #32
>  0x0000007fa8725e14: orr	x12, x12, x18, lsl #48
>  0x0000007fa8725e18: orr	x11, x13, x11, lsl #48
>  0x0000007fa8725e1c: add	x13, x2, x4
>0x0000007fa8725e20: str	x12, [x13]      ;*invokevirtual
>putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
>0x0000007fa8725e24: str	x11, [x13,#8]   ;*invokevirtual
>putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
>0x0000007fa8725e28: add	x4, x4, #0x10   ;*goto {reexecute=0 rethrow=0
>return_oop=0}
>0x0000007fa8725e2c: add	x3, x3, #0x8    ; ImmutableOopMap{r17=Oop
>c_rarg2=Oop }
>0x0000007fa8725e30: ldr	wzr, [x5]       ;*goto {reexecute=0 rethrow=0
>return_oop=0}
>                                                ;   {poll}
>  0x0000007fa8725e34: cmp	x3, x6
>0x0000007fa8725e38: b.lt	0x0000007fa8725de0  ;*ifge {reexecute=0
>rethrow=0 return_oop=0}
>                      ; - java.lang.StringLatin1::inflate at 61 (line 576)
>
>This is pretty good code.  (It's only little endian, but that's not
>hard to fix.)  I could not do any better writing this by hand unless I
>used the vector processor.  C1-generated code is worse than this, but
>it's still not bad.
>
>Perhaps it doesn't matter; perhaps we know there is no real point
>trying to make the Java versions of these methods "efficient".  We
>know that the real goal is the intrinsics which use the vector
>processor.
>
>And one other thing: if we had simple primitives available as HotSpot
>intrinsics to do a few simple vector pack and unpack operations we
>wouldn't need to write all these hand-carved assembly language
>String intrinsics.
>
>Thoughts?  Opinions?
>
>Andrew.