RFR: 8330611: AES-CTR vector intrinsic may read out of bounds (x86_64, AVX-512)

Wed Apr 24 00:03:28 UTC 2024

On Tue, 23 Apr 2024 20:22:59 GMT, Martin Balao <mbalao at openjdk.org> wrote:

>> We would like to propose a fix for 8330611.
>> 
>> To avoid an out of bounds memory read when the input's size is not multiple of the block size, we read the plaintext/ciphertext tail in 8, 4, 2 and 1 byte batches depending on what it is guaranteed to be available by 'len_reg'. This behavior replaces the read of 16 bytes of input upfront and later discard of spurious data.
>> 
>> While we add 3 extra instructions + 3 extra memory reads in the worst case —to the same cache line probably—, the performance impact of this fix should be low because it only occurs at the end of the input and when its length is not multiple of the block size.
>> 
>> A reliable test case for this bug is hard to develop because we would need accurate heap allocation. The fact that spuriously read data is silently discarded most of the time makes this bug harder to observe. No regressions have been observed in the compiler/codegen/aes jtreg category. Additionally, we verified the fix manually with the debugger.
>> 
>> This work is in collaboration with @franferrax .
>
> The proposed alternative does not look good to us. The `k1` mask has 8-bytes granularity: each bit of the mask represents 64 bits of the `xmm0` register in this case. Thus, it is not possible to avoid an out of bounds read for all scenarios that we intend to cover. We verified this with memory watchpoints —hit upon read— and looking at the `xmm0` register value after the `xor` operation, for an execution in which the tail had 15 bytes. What follows is a simplified execution that shows the behavior for `k1` masks of 0x1 and 0x2.
> 
> k1 == 0x1:
> 
> (gdb) x/2i $pc
> => 0x7fffe4730bc6:	vpxorq (%rdi,%r12,1),%xmm0,%xmm0{%k1}
>    0x7fffe4730bcd:	test   $0x8,%r8b
> (gdb) print/x $xmm0
> $21 = {
>   v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>   v4_float = {0x0, 0x0, 0x0, 0x0},
>   v2_double = {0x0, 0x0},
>   v16_int8 = {0x0 <repeats 16 times>},
>   v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>   v4_int32 = {0x0, 0x0, 0x0, 0x0},
>   v2_int64 = {0x0, 0x0},
>   uint128 = 0x0
> }
> (gdb) print/x $k1
> $22 = 0x1
> (gdb) x/16xb 0x45f33ddc8
> 0x45f33ddc8:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0x80
> 0x45f33ddd0:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0xbd
> (gdb) si
> 0x00007fffe4730bcd in ?? ()
> (gdb) print/x $xmm0
> $23 = {
>   v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>   v4_float = {0x0, 0x0, 0x0, 0x0},
>   v2_double = {0x0, 0x0},
>   v16_int8 = {0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>   v8_int16 = {0x8080, 0x8080, 0x8080, 0x8080, 0x0, 0x0, 0x0, 0x0},
>   v4_int32 = {0x80808080, 0x80808080, 0x0, 0x0},
>   v2_int64 = {0x8080808080808080, 0x0},
>   uint128 = 0x8080808080808080
> }
> 
> 
> A mask of 0x1 permitted the write of the lower 64 bits of the `xmm0` register. This corresponds to the first 8 bytes in memory (little endian).
> 
> k1 == 0x2:
> 
> (gdb) x/2i $pc
> => 0x7fffe4730bc6:	vpxorq (%rdi,%r12,1),%xmm0,%xmm0{%k1}
>    0x7fffe4730bcd:	test   $0x8,%r8b
> (gdb) print/x $xmm0
> $18 = {
>   v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>   v4_float = {0x0, 0x0, 0x0, 0x0},
>   v2_double = {0x0, 0x0},
>   v16_int8 = {0x0 <repeats 16 times>},
>   v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>   v4_int32 = {0x0, 0x0, 0x0, 0x0},
>   v2_int64 = {0x0, 0x0},
>   uint128 = 0x0
> }
> (gdb) print/x $k1
> $19 = 0x2
> (gdb) x/16xb 0x45f33ddc8
> 0x45f33ddc8:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0x80
> 0x45f33ddd0:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0xbd
> (gdb) si
> 0x00007fffe4730bcd in ?? ()
> (gdb) print/x $xmm0
> $20 = {
>   v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>   v4_float = {0x0, 0x0,...

@martinuy You are right that evpxorq has a 8 byte granularity for mask and so cannot be used for tail processing. I have one comment on your original PR, please take a look.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18849#issuecomment-2073680623