RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

Volodymyr Paprotski vpaprotski at openjdk.org
Wed Jan 7 00:22:35 UTC 2026


On Sat, 3 Jan 2026 00:23:13 GMT, Shawn M Emery <duke at openjdk.org> wrote:

>> This change allows use of the AVX512_VBMI/VMBI2 instruction set to further optimize decompression/parsing of polynomial coefficients for ML-KEM.  The speedup gained in the ML-KEM benchmarks for key generation is between 0.2 to 0.5%, encapsulation is  0.3 to 1.5%, and decapsulation is 0 to 0.9%.
>> 
>> Thank you to @sviswa7 and @ferakocz for their help in working through the early stages of this code with me.
>
> Shawn M Emery has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update copyright year

"Insert 0b0000 nibble after every third nibble". I only have two questions, looks good otherwise.


PS: things I've considered:

- Loop controls?
  - ML_KEM.java guarantees  (per callee comment and assert) lengths are multiple of 64
  - also same as original code
- Why not simply a vpermb? Have zeroes already from the masked load with k1..
  - shuffle granularity is actually 4-bits, not 8-bits
- logical shift already zeroes top bits, so `vpand` not required?
  - odd columns not shifted, so still have extra bits that need clearing
- Why VBMI?
  - needed for `evpermb`

src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 862:

> 860:   __ addptr(condensed, condensedOffs);
> 861: 
> 862:   if (VM_Version::supports_avx512_vbmi2()) {

Which instruction needs vbmi2? All I could spot was that `evpermb` that needs vbmi. Relax the restriction slightly?

src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 906:

> 904:       __ addptr(condensed, 192);
> 905:       __ addptr(parsed, 256);
> 906:       __ subl(parsedLength, 128);

(128 instead of 256 here because `parsedLength` is an index to an `short` array..)

I am confused by the stride. The `twelve2Sixteen()` seems to (almost) guarantee that the parsed length is a multiple of 64 (last block can be 48 bytes). This would imply a stride of 128 bytes for `parsed`. And 96 for `condensed`.

This is exactly how the existing code already behaves so I am less concerned, but I would like an explanation why it works?

-------------

PR Review: https://git.openjdk.org/jdk/pull/28815#pullrequestreview-3632845110
PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2666594767
PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2666663039


More information about the hotspot-compiler-dev mailing list