[aarch64-port-dev ] RFR(M): 8189112 - AARCH64: optimize StringUTF16 compress intrinsic
Andrew Haley
aph at redhat.com
Tue May 15 17:10:46 UTC 2018
On 05/08/2018 02:26 PM, Dmitrij Pochepko wrote:
> Hi all,
>
> please review patch for 8189112 - AARCH64: optimize StringUTF16 compress
> intrinsic
>
> This patch is based on 3 improvement ideas:
>
> - introduction of additional large loop with prefetch instruction for
> long strings
> - different compression implementation, using uzp1 and uzp2 instructions
> instead of uqxtn and uqxtn2, which are more expensive. It also allows to
> drop direct FPSR register operations, which are very slow on some CPUs.
> - slightly another codeshape, which mostly executes branches and
> independent operations while loads and stores are used (helps "in-order"
> CPUs)
>
> benchmarks: I created JMH benchmark with direct call via reflection:
> http://cr.openjdk.java.net/~dpochepk/8189112/StrCompressBench.java
I think this benchmark is misleading because it uses Method.invoke()
in the inner timing loop. I rewrote it to use a MethodHandle, and got:
Benchmark (ALL) (size) Mode Cnt Score Error Units
StrCompressBench.compressDifferent 1000000 256 avgt 10 394.814 ± 69.714 ns/op
StrCompressBench.compressDifferentHandle 1000000 256 avgt 10 242.431 ± 0.861 ns/op
It's at http://cr.openjdk.java.net/~aph/8189112/StrCompressBench.java
(Note: Method.invoke() has so much jitter because it does a ton
of work boxing and unboxing the args. You'll see this if you look
at the disassembly of StrCompressBench.compressDifferent() .)
With that change, I get (on APM Mustang)
Before your change:
Benchmark (ALL) (size) Mode Cnt Score Error Units
StrCompressBench.compressDifferentHandle 1000000 4 avgt 10 30.739 ± 0.128 ns/op
StrCompressBench.compressDifferentHandle 1000000 8 avgt 10 33.451 ± 0.172 ns/op
StrCompressBench.compressDifferentHandle 1000000 16 avgt 10 42.327 ± 0.058 ns/op
StrCompressBench.compressDifferentHandle 1000000 256 avgt 10 389.433 ± 1.608 ns/op
StrCompressBench.compressDifferentHandle 1000000 1024 avgt 10 1028.375 ± 4.364 ns/op
StrCompressBench.compressDifferentHandle 1000000 32768 avgt 10 15321.996 ± 5.059 ns/op
After:
Benchmark (ALL) (size) Mode Cnt Score Error Units
StrCompressBench.compressDifferentHandle 1000000 4 avgt 10 30.097 ± 0.071 ns/op
StrCompressBench.compressDifferentHandle 1000000 8 avgt 10 29.482 ± 0.122 ns/op
StrCompressBench.compressDifferentHandle 1000000 16 avgt 10 36.548 ± 0.070 ns/op
StrCompressBench.compressDifferentHandle 1000000 256 avgt 10 240.499 ± 0.446 ns/op
StrCompressBench.compressDifferentHandle 1000000 1024 avgt 10 603.500 ± 0.829 ns/op
StrCompressBench.compressDifferentHandle 1000000 32768 avgt 10 14538.528 ± 30.215 ns/op
... which is a decent-enough speedup for medium-sized strings.
OK.
--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list