RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
Andrew Haley
aph at openjdk.java.net
Fri Apr 2 08:49:29 UTC 2021
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo <dongbo at openjdk.org> wrote:
>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>>
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>>
>> There can be illegal characters at the start of the input if the data is MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>>
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>>
>> The Base64Decode.java JMH micro-benchmark results:
>>
>> Benchmark (lineSize) (maxNumBytes) Mode Cnt Score Error Units
>>
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode 4 1 avgt 5 48.614 ± 0.609 ns/op
>> Base64Decode.testBase64Decode 4 3 avgt 5 58.199 ± 1.650 ns/op
>> Base64Decode.testBase64Decode 4 7 avgt 5 69.400 ± 0.931 ns/op
>> Base64Decode.testBase64Decode 4 32 avgt 5 96.818 ± 1.687 ns/op
>> Base64Decode.testBase64Decode 4 64 avgt 5 122.856 ± 9.217 ns/op
>> Base64Decode.testBase64Decode 4 80 avgt 5 130.935 ± 1.667 ns/op
>> Base64Decode.testBase64Decode 4 96 avgt 5 143.627 ± 1.751 ns/op
>> Base64Decode.testBase64Decode 4 112 avgt 5 152.311 ± 1.178 ns/op
>> Base64Decode.testBase64Decode 4 512 avgt 5 342.631 ± 0.584 ns/op
>> Base64Decode.testBase64Decode 4 1000 avgt 5 573.635 ± 1.050 ns/op
>> Base64Decode.testBase64Decode 4 20000 avgt 5 9534.136 ± 45.172 ns/op
>> Base64Decode.testBase64Decode 4 50000 avgt 5 22718.726 ± 192.070 ns/op
>> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 63.558 ± 0.336 ns/op
>> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 82.504 ± 0.848 ns/op
>> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 120.591 ± 0.608 ns/op
>> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 324.314 ± 6.236 ns/op
>> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 532.678 ± 4.670 ns/op
>> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 678.126 ± 4.324 ns/op
>> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 771.603 ± 6.393 ns/op
>> Base64Decode.testBase64MIMEDecode 4 112 avgt 10 889.608 ± 0.759 ns/op
>> Base64Decode.testBase64MIMEDecode 4 512 avgt 10 3663.557 ± 3.422 ns/op
>> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 7017.784 ± 9.128 ns/op
>> Base64Decode.testBase64MIMEDecode 4 20000 avgt 10 128670.660 ± 7951.521 ns/op
>> Base64Decode.testBase64MIMEDecode 4 50000 avgt 10 317113.667 ± 161.758 ns/op
>>
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode 4 1 avgt 5 48.455 ± 0.571 ns/op
>> Base64Decode.testBase64Decode 4 3 avgt 5 57.937 ± 0.505 ns/op
>> Base64Decode.testBase64Decode 4 7 avgt 5 73.823 ± 1.452 ns/op
>> Base64Decode.testBase64Decode 4 32 avgt 5 106.484 ± 1.243 ns/op
>> Base64Decode.testBase64Decode 4 64 avgt 5 141.004 ± 1.188 ns/op
>> Base64Decode.testBase64Decode 4 80 avgt 5 156.284 ± 0.572 ns/op
>> Base64Decode.testBase64Decode 4 96 avgt 5 174.137 ± 0.177 ns/op
>> Base64Decode.testBase64Decode 4 112 avgt 5 188.445 ± 0.572 ns/op
>> Base64Decode.testBase64Decode 4 512 avgt 5 610.847 ± 1.559 ns/op
>> Base64Decode.testBase64Decode 4 1000 avgt 5 1155.368 ± 0.813 ns/op
>> Base64Decode.testBase64Decode 4 20000 avgt 5 19751.477 ± 24.669 ns/op
>> Base64Decode.testBase64Decode 4 50000 avgt 5 50046.586 ± 523.155 ns/op
>> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 64.130 ± 0.238 ns/op
>> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 82.096 ± 0.205 ns/op
>> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 118.849 ± 0.610 ns/op
>> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 331.177 ± 4.732 ns/op
>> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 549.117 ± 0.177 ns/op
>> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 702.951 ± 4.572 ns/op
>> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 799.566 ± 0.301 ns/op
>> Base64Decode.testBase64MIMEDecode 4 112 avgt 10 923.749 ± 0.389 ns/op
>> Base64Decode.testBase64MIMEDecode 4 512 avgt 10 4000.725 ± 2.519 ns/op
>> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 7674.994 ± 9.281 ns/op
>> Base64Decode.testBase64MIMEDecode 4 20000 avgt 10 142059.001 ± 157.920 ns/op
>> Base64Decode.testBase64MIMEDecode 4 50000 avgt 10 355698.369 ± 216.542 ns/op
>
> Dong Bo has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
>
> - Merge branch 'master' into aarch64.base64.decode
> - copyright
> - trivial fixes
> - Handling error in SIMD case with loops, combining two non-SIMD cases into one code blob, addressing other comments
> - Merge branch 'master' into aarch64.base64.decode
> - 8256245: AArch64: Implement Base64 decoding intrinsic
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5811:
> 5809: __ ldrb(r12, __ post(src, 1));
> 5810: __ ldrb(r13, __ post(src, 1));
> 5811: // get the de-code
For loads and four post increments rather than one load and a few BFMs? Why?
-------------
PR: https://git.openjdk.java.net/jdk/pull/3228
More information about the hotspot-compiler-dev
mailing list