RFR: 8330611: AES-CTR vector intrinsic may read out of bounds (x86_64, AVX-512)

Tue Apr 23 20:25:28 UTC 2024

On Fri, 19 Apr 2024 00:04:41 GMT, Martin Balao <mbalao at openjdk.org> wrote:

> We would like to propose a fix for 8330611.
> 
> To avoid an out of bounds memory read when the input's size is not multiple of the block size, we read the plaintext/ciphertext tail in 8, 4, 2 and 1 byte batches depending on what it is guaranteed to be available by 'len_reg'. This behavior replaces the read of 16 bytes of input upfront and later discard of spurious data.
> 
> While we add 3 extra instructions + 3 extra memory reads in the worst case —to the same cache line probably—, the performance impact of this fix should be low because it only occurs at the end of the input and when its length is not multiple of the block size.
> 
> A reliable test case for this bug is hard to develop because we would need accurate heap allocation. The fact that spuriously read data is silently discarded most of the time makes this bug harder to observe. No regressions have been observed in the compiler/codegen/aes jtreg category. Additionally, we verified the fix manually with the debugger.
> 
> This work is in collaboration with @franferrax .

The proposed alternative does not look good to us. The `k1` mask has 8-bytes granularity: each bit of the mask represents 64 bits of the `xmm0` register in this case. Thus, it is not possible to avoid an out of bounds read for all scenarios that we intend to cover. We verified this with memory watchpoints —hit upon read— and looking at the `xmm0` register value after the `xor` operation, for an execution in which the tail had 15 bytes. What follows is a simplified execution that shows the behavior for `k1` masks of 0x1 and 0x2.

k1 == 0x1:

(gdb) x/2i $pc
=> 0x7fffe4730bc6:	vpxorq (%rdi,%r12,1),%xmm0,%xmm0{%k1}
   0x7fffe4730bcd:	test   $0x8,%r8b
(gdb) print/x $xmm0
$21 = {
  v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_float = {0x0, 0x0, 0x0, 0x0},
  v2_double = {0x0, 0x0},
  v16_int8 = {0x0 <repeats 16 times>},
  v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_int32 = {0x0, 0x0, 0x0, 0x0},
  v2_int64 = {0x0, 0x0},
  uint128 = 0x0
}
(gdb) print/x $k1
$22 = 0x1
(gdb) x/16xb 0x45f33ddc8
0x45f33ddc8:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0x80
0x45f33ddd0:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0xbd
(gdb) si
0x00007fffe4730bcd in ?? ()
(gdb) print/x $xmm0
$23 = {
  v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_float = {0x0, 0x0, 0x0, 0x0},
  v2_double = {0x0, 0x0},
  v16_int8 = {0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v8_int16 = {0x8080, 0x8080, 0x8080, 0x8080, 0x0, 0x0, 0x0, 0x0},
  v4_int32 = {0x80808080, 0x80808080, 0x0, 0x0},
  v2_int64 = {0x8080808080808080, 0x0},
  uint128 = 0x8080808080808080
}

A mask of 0x1 permitted the write of the lower 64 bits of the `xmm0` register. This corresponds to the first 8 bytes in memory (little endian).

k1 == 0x2:

(gdb) x/2i $pc
=> 0x7fffe4730bc6:	vpxorq (%rdi,%r12,1),%xmm0,%xmm0{%k1}
   0x7fffe4730bcd:	test   $0x8,%r8b
(gdb) print/x $xmm0
$18 = {
  v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_float = {0x0, 0x0, 0x0, 0x0},
  v2_double = {0x0, 0x0},
  v16_int8 = {0x0 <repeats 16 times>},
  v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_int32 = {0x0, 0x0, 0x0, 0x0},
  v2_int64 = {0x0, 0x0},
  uint128 = 0x0
}
(gdb) print/x $k1
$19 = 0x2
(gdb) x/16xb 0x45f33ddc8
0x45f33ddc8:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0x80
0x45f33ddd0:	0x80	0x80	0x80	0x80	0x80	0x80	0x80	0xbd
(gdb) si
0x00007fffe4730bcd in ?? ()
(gdb) print/x $xmm0
$20 = {
  v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_float = {0x0, 0x0, 0x0, 0x0},
  v2_double = {0x0, 0x0},
  v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0xbd},
  v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x8080, 0x8080, 0x8080, 0xbd80},
  v4_int32 = {0x0, 0x0, 0x80808080, 0xbd808080},
  v2_int64 = {0x0, 0xbd80808080808080},
  uint128 = 0xbd808080808080800000000000000000
}

A mask of 0x2 permitted the write of the higher 64 bits of the `xmm0` register. This corresponds to the second 8 bytes in memory (little endian). Note: we are not sure if the microarchitecture reads the first 8 bytes and if this might trigger a segmentation fault —all we know is that they are not written to the register—, but in any case it is useless for our tail processing purposes because there aren't gaps in the input.

Finally, @franferrax found an error in the `VPXORQ` pseudo-code documentation of the [Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 2B 4-527](https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4): the operation when the mask applies should be `DEST[i+63:i] := 0` instead of `DEST[63:0] := 0`. Otherwise, the mask would apply to the lower bits irrespective of its value. This observation applies to both merging and zeroing masks.

If there are no further objections, our intention is to integrate the original fix. What do you think, @theRealAph ?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18849#issuecomment-2073379174