RFR: 8371864: GaloisCounterMode.implGCMCrypt0 AVX512/AVX2 intrinsics stubs cause AES-GCM encryption failure for certain payload sizes [v10]
Jiangli Zhou
jiangli at openjdk.org
Sat Nov 29 04:59:53 UTC 2025
On Fri, 28 Nov 2025 09:42:22 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> Oh man, the `pos`/`len` modifications in current code are confusing. I scratched my head for quite a while trying to comprehend why does `__ bind(MESG_BELOW_32_BLKS)` split the `pos += 16` and `len -= 16`? On a surface, that just looks like a bug.
The combination of handling for the fall through from `ENCRYPT_16_BLKS` and conditional entry to `MESG_BELOW_32_BLKS` cases are subtle.
I had also missed the fall through case in my initial proposed fix (with comp/jcc) until @sviswa7 pointed it out and suggested the current fix. The fix for `StubGenerator::aesgcm_avx512` with moving `__ addl(pos, 16 * 16)` to be before `__ bind(MESG_BELOW_32_BLKS)` works correctly for both the fall through and conditional jump cases now.
>
> But looks that way because we do `initial_blocks_16_avx512` twice, do `pos += 16` twice, but only do the `len += 32` after the second call. Which does not help if we shortcut after the first call. In fact, I am not at all sure that checking `len < 32` _without_ modifying `len` beforehand does not break the assumptions downstream:
>
> ```
> initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset);
> __ addl(pos, 16 * 16);
> __ cmpl(len, 32 * 16);
> __ jcc(Assembler::below, MESG_BELOW_32_BLKS);
> ```
>
> Really, in these kind of intrinsics, _you want_ to make sure `pos` and `len` updates are tightly bound together, otherwise these kinds of mistakes would keep happening. You will lose on code density a bit, but would have more readable and robust code.
>
> Shouldn't it be like this?
>
> ```
> diff --git a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
> index 1e728ffa279..a16e25b075d 100644
> --- a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
> +++ b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
> @@ -3475,12 +3475,14 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
>
> initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset);
> __ addl(pos, 16 * 16);
> + __ subl(len, 16 * 16);
> +
> __ cmpl(len, 32 * 16);
> __ jcc(Assembler::below, MESG_BELOW_32_BLKS);
>
> initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset + 16);
> __ addl(pos, 16 * 16);
> - __ subl(len, 32 * 16);
> + __ subl(len, 16 * 16);
>
> __ cmpl(len, 32 * 16);
> __ jcc(Assembler::below, NO_BIG_BLKS);
> @@ -3491,24 +3493,27 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
> ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
> true, true, false, false, false, ghashin_offset, aesout_offset, HashKey_32);
> __ addl(pos, 16 * 16);
> + __ subl(len, 16 * 16);
>
> ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
> true, false, true, false, true, ghashin_offset + 16, aesout_offset + 16, HashKey_16);
> __ evmovdquq(AAD_HASHx, ZTMP4, Assembler::AVX_512bit);
> __ addl(pos, 16 * 16);
> - __ subl(len, 32 * 16);
> + __ subl(len, 16 * 16);
> __ jmp(ENCRYPT_BIG_BLKS_NO_HXOR);
>
> __ bind(ENCRYPT_BIG_NBLKS);
> ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
> false, true, false, false, false, ghashin_offset, aesout_offset, HashKey_32);
> __ addl(pos, 16 * 16);
> + __ subl(len, 16 * 16);
> +
> ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
> false, false, true, true, true, ghashin_offset + 16, aesout_offset + 16, HashKey_16);
>
> __ movdqu(AAD_HASHx, ZTMP4);
> __ addl(pos, 16 * 16);
> - __ subl(len, 32 * 16);
> + __ subl(len, 16 * 16);
>
> __ bind(NO_BIG_BLKS);
> __ cmpl(len, 16 * 16);
> @@ -3525,9 +3530,9 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
>
> ghash16_avx512(false, true, false, false, true, in, pos, avx512_subkeyHtbl, AAD_HASHx, SHUF_MASK, stack_offset, 16 * 16, 0, HashKey_16);
> __ addl(pos, 16 * 16);
> + __ subl(len, 16 * 16);
>
> __ bind(MESG_BELOW_32_BLKS);
> - __ subl(len, 16 * 16);
> gcm_enc_dec_last_avx512(len, in, pos, AAD_HASHx, SHUF_MASK, avx512_subkeyHtbl, ghashin_offset, HashKey_16, true, true);
>
> __ bind(GHASH_DONE);
> ```
Improving readability is a good idea, hand-rolled assembly however is mostly motivated by performance. While `sub` with immediate value is fast and takes one cpu cycle, I would agree with the original author of `aesgcm_avx512` on combining two `sub` instructions into one instruction whenever possible.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28363#issuecomment-3590992730
More information about the hotspot-compiler-dev
mailing list