RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

Wed Jan 7 16:46:08 UTC 2026

On Wed, 7 Jan 2026 06:19:09 GMT, Shawn M Emery <duke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 906:
>> 
>>> 904:       __ addptr(condensed, 192);
>>> 905:       __ addptr(parsed, 256);
>>> 906:       __ subl(parsedLength, 128);
>> 
>> (128 instead of 256 here because `parsedLength` is an index to an `short` array..)
>> 
>> I am confused by the stride. The `twelve2Sixteen()` seems to (almost) guarantee that the parsed length is a multiple of 64 (last block can be 48 bytes). This would imply a stride of 128 bytes for `parsed`. And 96 for `condensed`.
>> 
>> This is exactly how the existing code already behaves so I am less concerned, but I would like an explanation why it works?
>
> I believe the numbers are right: with each pass 256 bytes of coefficients are `parsed` into the parse buffer.  This means that half of the coefficients have been processed (`parsedLength` = 128).  Would having a comment stating as such address your concerns?

I wasn't as clear in my question. The asm is indeed processing the bytes in the increment. What I was trying to convince myself about.. 'how come we are not reading past the end of the array. Or are we?'.

On one hand, this is exactly what the existing asm code does, so I will assume that its correct. However, on the java side/version of this code, I could only convince myself about processing ~two AVX512 vectors at a time, not four.

So either I cant count, or there is some further (implicit) restrictions on the callers of `twelve2Sixteen`

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2669202305