RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]
Vladimir Kozlov
kvn at openjdk.java.net
Thu Jun 24 06:11:36 UTC 2021
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons <github.com+6704669+asgibbons at openjdk.org> wrote:
>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. Also allows for performance improvement for non-AVX-512 enabled platforms. Due to the nature of MIME-encoded inputs, modify the intrinsic signature to accept an additional parameter (isMIME) for fast-path MIME decoding.
>>
>> A change was made to the signature of DecodeBlock in Base64.java to provide the intrinsic information as to whether MIME decoding was being done. This allows for the intrinsic to bypass the expensive setup of zmm registers from AVX tables, knowing there may be invalid Base64 characters every 76 characters or so. A change was also made here removing the restriction that the intrinsic must return an even multiple of 3 bytes decoded. This implementation handles the pad characters at the end of the string and will return the actual number of characters decoded.
>>
>> The AVX portion of this code will decode in blocks of 256 bytes per loop iteration, then in chunks of 64 bytes, followed by end fixup decoding. The non-AVX code is an assembly-optimized version of the java DecodeBlock and behaves identically.
>>
>> Running the Base64Decode benchmark, this change increases decode performance by an average of 2.6x with a maximum 19.7x for buffers > ~20k. The numbers are given in the table below.
>>
>> **Base Score** is without intrinsic support, **Optimized Score** is using this intrinsic, and **Gain** is **Base** / **Optimized**.
>>
>>
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 20000 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 50000 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 20000 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 50000 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size 20000 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size 50000 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
>
> Fixing Windows build warnings
I hit strange failure in compiler/intrinsics/base64/TestBase64.java test on Windows machine which have Intel 8167M cpu (AVX512).
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ff92bcbd99e, pid=24628, tid=6804
#
# Problematic frame:
# V [jvm.dll+0xabd99e] ObjectMonitor::object_peek+0xe
#
Current thread (0x0000016c923de2c0): JavaThread "MainThread" [_thread_in_Java, id=6804, stack(0x00000060df600000,0x00000060df700000)]
Stack: [0x00000060df600000,0x00000060df700000], sp=0x00000060df6fcb50, free space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [jvm.dll+0xabd99e] ObjectMonitor::object_peek+0xe (objectMonitor.cpp:304)
V [jvm.dll+0xc48d5b] ObjectSynchronizer::quick_enter+0x9b (synchronizer.cpp:331)
V [jvm.dll+0xb9b6f6] SharedRuntime::monitor_enter_helper+0x36 (sharedRuntime.cpp:2112)
V [jvm.dll+0x389894] Runtime1::monitorenter+0x94 (c1_Runtime1.cpp:748)
C 0x0000016c99c4a757
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
v ~RuntimeStub::monitorenter_nofpu Runtime1 stub
J 40 c1 java.util.concurrent.ConcurrentHashMap.putVal(Ljava/lang/Object;Ljava/lang/Object;Z)Ljava/lang/Object; java.base at 18-internal (432 bytes) @ 0x0000016c9a1801f8 [0x0000016c9a17e6a0+0x0000000000001b58]
J 43 c1 java.util.concurrent.ConcurrentHashMap.putIfAbsent(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; java.base at 18-internal (8 bytes) @ 0x0000016c9a181c34 [0x0000016c9a181bc0+0x0000000000000074]
j java.lang.ClassLoader.getClassLoadingLock(Ljava/lang/String;)Ljava/lang/Object;+23 java.base at 18-internal
j jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(Ljava/lang/String;Z)Ljava/lang/Class;+2 java.base at 18-internal
j jdk.internal.loader.BuiltinClassLoader.loadClass(Ljava/lang/String;Z)Ljava/lang/Class;+3 java.base at 18-internal
j jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Ljava/lang/String;Z)Ljava/lang/Class;+36 java.base at 18-internal
j java.lang.ClassLoader.loadClass(Ljava/lang/String;)Ljava/lang/Class;+3 java.base at 18-internal
v ~StubRoutines::call_stub
j compiler.intrinsics.base64.TestBase64.test0(Lcompiler/intrinsics/base64/TestBase64$FileType;Lcompiler/intrinsics/base64/TestBase64$Base64Type;Ljava/util/Base64$Encoder;Ljava/util/Base64$Decoder;Ljava/lang/String;Ljava/lang/String;I)V+25
j compiler.intrinsics.base64.TestBase64.main([Ljava/lang/String;)V+116
v ~StubRoutines::call_stub
j jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0 java.base at 18-internal
j jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+133 java.base at 18-internal
j jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6 java.base at 18-internal
j java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59 java.base at 18-internal
j com.sun.javatest.regtest.agent.MainWrapper$MainThread.run()V+172
j java.lang.Thread.run()V+11 java.base at 18-internal
v ~StubRoutines::call_stub
siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x00000000000000bc
Register to memory mapping:
RIP=0x00007ff92bcbd99e jvm.dll::ObjectMonitor::object_peek + 0xe
RAX=0x00000000000000ac is an unknown value
RBX=0x00000000000000ac is an unknown value
RCX=0x00000000000000ac is an unknown value
RDX=0x0 is NULL
RSP=0x00000060df6fcb50 is pointing into the stack for thread: 0x0000016c923de2c0
RBP=0x00000060df6fd110 is pointing into the stack for thread: 0x0000016c923de2c0
RSI=0x0000016c923de2c0 is a thread
RDI=0x0000016c923de2c0 is a thread
R8 =0x00000060df6fd1f0 is pointing into the stack for thread: 0x0000016c923de2c0
R9 =0x00000000000002f8 is an unknown value
R10=0x00007ff92b589800 jvm.dll::Runtime1::monitorenter + 0x0
R11=0x00000060df6fcc78 is pointing into the stack for thread: 0x0000016c923de2c0
R12=0x0 is NULL
R13=0x0000000000000200 is an unknown value
R14=0x0000000000000396 is an unknown value
R15=0x0000016c923de2c0 is a thread
Registers:
RAX=0x00000000000000ac, RBX=0x00000000000000ac, RCX=0x00000000000000ac, RDX=0x0000000000000000
RSP=0x00000060df6fcb50, RBP=0x00000060df6fd110, RSI=0x0000016c923de2c0, RDI=0x0000016c923de2c0
R8 =0x00000060df6fd1f0, R9 =0x00000000000002f8, R10=0x00007ff92b589800, R11=0x00000060df6fcc78
R12=0x0000000000000000, R13=0x0000000000000200, R14=0x0000000000000396, R15=0x0000016c923de2c0
RIP=0x00007ff92bcbd99e, EFLAGS=0x0000000000010206
Top of Stack: (sp=0x00000060df6fcb50)
0x00000060df6fcb50: 0000016c923de2c0 0000000000000000
0x00000060df6fcb60: 0000000000000000 00007ff92b8980a0
0x00000060df6fcb70: 0000016c923de2c0 00007ff92be48d5b
0x00000060df6fcb80: 00000000000000ac 000000074bd727d0
0x00000060df6fcb90: 0000000000000000 0000000000000000
0x00000060df6fcba0: 0000000000000000 00007ff92c1de2b0
0x00000060df6fcbb0: 0000016c923de2c0 00007ff92b8980a0
0x00000060df6fcbc0: 00000060df6fd1f0 00007ff92bd9b6f6
0x00000060df6fcbd0: 000000074bd727d0 0000016c923de2c0
0x00000060df6fcbe0: 00000060df6fd1f0 0000016c923de2c0
0x00000060df6fcbf0: 0000000000000000 0000000000000000
0x00000060df6fcc00: 0000000000000000 0000000000000000
0x00000060df6fcc10: 0000000000000000 0000000000000000
0x00000060df6fcc20: 000000074bd727d0 00007ff92b589894
0x00000060df6fcc30: 000000074bd727d0 00000060df6fd1f0
0x00000060df6fcc40: 0000016c923de2c0 00007ff92b8980a0
Instructions: (pc=0x00007ff92bcbd99e)
0x00007ff92bcbd89e: ff 48 8b c8 48 8b d8 48 8b 10 ff 52 48 48 8b 13
0x00007ff92bcbd8ae: 48 8b cb 84 c0 0f 84 83 00 00 00 ff 52 48 84 c0
0x00007ff92bcbd8be: 75 24 4c 8d 0d f1 7b 2e 00 ba 91 05 00 00 4c 8d
0x00007ff92bcbd8ce: 05 05 7c 2e 00 48 8d 0d c6 8b 2d 00 e8 71 aa a0
0x00007ff92bcbd8de: ff e8 3c c3 01 00 8b 83 88 03 00 00 83 c0 fa a9
0x00007ff92bcbd8ee: fd ff ff ff 74 23 4c 8d 0d c5 25 4f 00 41 b8 05
0x00007ff92bcbd8fe: 01 00 00 48 8d 15 e0 25 4f 00 b9 00 00 00 e0 e8
0x00007ff92bcbd90e: 4e a7 a0 ff e8 09 c3 01 00 48 8b 03 48 8b cb ff
0x00007ff92bcbd91e: 90 b8 00 00 00 84 c0 75 40 4c 8d 0d fa 25 4f 00
0x00007ff92bcbd92e: ba 07 01 00 00 4c 8d 05 0e 26 4f 00 eb 1a ff 52
0x00007ff92bcbd93e: 40 84 c0 75 24 4c 8d 0d 7e 86 2d 00 ba 0b 01 00
0x00007ff92bcbd94e: 00 4c 8d 05 22 26 4f 00 48 8d 0d 8b 25 4f 00 e8
0x00007ff92bcbd95e: ee a9 a0 ff e8 b9 c2 01 00 48 8b 44 24 30 48 8b
0x00007ff92bcbd96e: 48 10 48 85 c9 75 08 33 c0 48 83 c4 20 5b c3 48
0x00007ff92bcbd97e: 83 c4 20 5b 48 ff 25 8f f4 6f 00 cc cc cc cc cc
0x00007ff92bcbd98e: cc cc 48 89 4c 24 08 48 83 ec 28 48 8b 44 24 30
0x00007ff92bcbd99e: 48 8b 48 10 48 85 c9 75 07 33 c0 48 83 c4 28 c3
0x00007ff92bcbd9ae: 48 83 c4 28 48 ff 25 5f f5 6f 00 cc cc cc cc cc
0x00007ff92bcbd9be: cc cc 48 89 5c 24 18 48 89 54 24 10 48 89 4c 24
0x00007ff92bcbd9ce: 08 57 48 83 ec 20 48 8b 5c 24 30 48 8d 15 68 37
0x00007ff92bcbd9de: 4f 00 48 8b 7c 24 38 4c 8b c3 48 8b cf e8 50 8f
0x00007ff92bcbd9ee: 02 00 4c 8b 43 08 48 8d 15 6d 37 4f 00 48 8b cf
0x00007ff92bcbd9fe: e8 3d 8f 02 00 48 8b 4b 10 48 85 c9 75 04 33 c0
0x00007ff92bcbda0e: eb 06 ff 15 02 f5 6f 00 4c 8b c0 48 8d 15 60 37
0x00007ff92bcbda1e: 4f 00 48 8b cf e8 18 8f 02 00 48 8d 15 69 37 4f
0x00007ff92bcbda2e: 00 48 8b cf e8 09 8f 02 00 48 8d 15 6a 37 4f 00
0x00007ff92bcbda3e: 48 8b cf e8 fa 8e 02 00 48 8d 15 6b 37 4f 00 48
0x00007ff92bcbda4e: 8b cf e8 eb 8e 02 00 41 b8 2f 00 00 00 48 8d 15
0x00007ff92bcbda5e: 5e 37 4f 00 48 8b cf e8 d6 8e 02 00 48 8d 15 5f
0x00007ff92bcbda6e: 37 4f 00 48 8b cf e8 c7 8e 02 00 4c 8b 43 48 48
0x00007ff92bcbda7e: 8d 15 54 37 4f 00 48 8b cf e8 b4 8e 02 00 4c 8b
0x00007ff92bcbda8e: 43 50 48 8d 15 59 37 4f 00 48 8b cf e8 a1 8e 02
Stack slot to memory mapping:
stack at sp + 0 slots: 0x0000016c923de2c0 is a thread
stack at sp + 1 slots: 0x0 is NULL
stack at sp + 2 slots: 0x0 is NULL
stack at sp + 3 slots: 0x00007ff92b8980a0 jvm.dll::VMEntryWrapper::VMEntryWrapper + 0x110
stack at sp + 4 slots: 0x0000016c923de2c0 is a thread
stack at sp + 5 slots: 0x00007ff92be48d5b jvm.dll::ObjectSynchronizer::quick_enter + 0x9b
stack at sp + 6 slots: 0x00000000000000ac is an unknown value
stack at sp + 7 slots: 0x000000074bd727d0 is an oop: java.util.concurrent.ConcurrentHashMap$Node
{0x000000074bd727d0} - klass: 'java/util/concurrent/ConcurrentHashMap$Node'
- ---- fields (total size 4 words):
- final 'hash' 'I' @12 683507634 (28bd7fb2)
- final 'key' 'Ljava/lang/Object;' @16 "java.util.Base64"{0x000000074bd72788} (e97ae4f1)
- volatile 'val' 'Ljava/lang/Object;' @20 a 'java/lang/Object'{0x000000074bd727c0} (e97ae4f8)
- volatile 'next' 'Ljava/util/concurrent/ConcurrentHashMap$Node;' @24 NULL (0)
-------------
PR: https://git.openjdk.java.net/jdk/pull/4368
More information about the hotspot-compiler-dev
mailing list