RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]
Volodymyr Paprotski
vpaprotski at openjdk.org
Wed Jan 7 00:22:35 UTC 2026
On Sat, 3 Jan 2026 00:23:13 GMT, Shawn M Emery <duke at openjdk.org> wrote:
>> This change allows use of the AVX512_VBMI/VMBI2 instruction set to further optimize decompression/parsing of polynomial coefficients for ML-KEM. The speedup gained in the ML-KEM benchmarks for key generation is between 0.2 to 0.5%, encapsulation is 0.3 to 1.5%, and decapsulation is 0 to 0.9%.
>>
>> Thank you to @sviswa7 and @ferakocz for their help in working through the early stages of this code with me.
>
> Shawn M Emery has updated the pull request incrementally with one additional commit since the last revision:
>
> Update copyright year
"Insert 0b0000 nibble after every third nibble". I only have two questions, looks good otherwise.
PS: things I've considered:
- Loop controls?
- ML_KEM.java guarantees (per callee comment and assert) lengths are multiple of 64
- also same as original code
- Why not simply a vpermb? Have zeroes already from the masked load with k1..
- shuffle granularity is actually 4-bits, not 8-bits
- logical shift already zeroes top bits, so `vpand` not required?
- odd columns not shifted, so still have extra bits that need clearing
- Why VBMI?
- needed for `evpermb`
src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 862:
> 860: __ addptr(condensed, condensedOffs);
> 861:
> 862: if (VM_Version::supports_avx512_vbmi2()) {
Which instruction needs vbmi2? All I could spot was that `evpermb` that needs vbmi. Relax the restriction slightly?
src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 906:
> 904: __ addptr(condensed, 192);
> 905: __ addptr(parsed, 256);
> 906: __ subl(parsedLength, 128);
(128 instead of 256 here because `parsedLength` is an index to an `short` array..)
I am confused by the stride. The `twelve2Sixteen()` seems to (almost) guarantee that the parsed length is a multiple of 64 (last block can be 48 bytes). This would imply a stride of 128 bytes for `parsed`. And 96 for `condensed`.
This is exactly how the existing code already behaves so I am less concerned, but I would like an explanation why it works?
-------------
PR Review: https://git.openjdk.org/jdk/pull/28815#pullrequestreview-3632845110
PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2666594767
PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2666663039
More information about the hotspot-compiler-dev
mailing list