RFR: 8272493: Suboptimal code generation around Preconditions.checkIndex intrinsic with AVX2

Thu Mar 10 08:22:41 UTC 2022

On Thu, 10 Mar 2022 07:55:16 GMT, Yi Yang <yyang at openjdk.org> wrote:

> 8272493 reports a minor regression when using Preconditions.checkIndex in String.checkIndex. The reason is some unnecessary vzeroupper instructions were emitted. The vzerouppers are introduced in [JDK-8190934](https://bugs.openjdk.java.net/browse/JDK-8190934), which are emitted by clear_upper_avx within inline_preconditions_checkIndex. I did some digging into the history of this code. Please correct me if I misunderstand something
> 
> [JDK-8178811](https://bugs.openjdk.java.net/browse/JDK-8178811) emits vzeroupper on every MachEpilogueNode to avoid AVX <-> SSE transition penalty during the call.
> 
> [JDK-8190934](https://bugs.openjdk.java.net/browse/JDK-8190934) emits vzeroupper on some MachEpilogueNode by setting clear_upper_avx flag, because vzeroupper itself is a high-cost instruction, we don't want to emit it everywhere a function is finished.
> 
> [JDK-8272493](https://bugs.openjdk.java.net/browse/JDK-8272493) emits vzeroupper because inline_preconditions_checkIndex sets clear_upper_avx flag.
> 
> Micro benchmark are as follows
> 
> -------Preconditions.checkIndex without clear_upper_avx
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.257 ± 0.011 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.251 ± 0.008 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.254 ± 0.003 ns/op
> 
> -------Preconditions.checkIndex with clear_upper_avx(Current Implementation)
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.421 ± 0.003 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.419 ± 0.002 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.433 ± 0.044 ns/op
> 
> ------- -XX:DisableIntrinsic=_Preconditions_checkIndex
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.229 ± 0.018 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.224 ± 0.006 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.218 ± 0.011 ns/op
> 
> ------- -XX:UseAVX=1
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.247 ± 0.022 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.234 ± 0.018 ns/op
> 
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.261 ± 0.042 ns/op
> 
> As I understand, inline_Preconditions_checkIndex only do some simple range check, there is no xmm(sse)/ymm(avx) 
>  registers involved, so I propose to remove clear_upper_avx flag to avoid emitting vzeroupper for this intrinsic.

IMO the hotspot seems to be too conservative and yet not cover all the cases regarding the generation of `vzeroupper`. Given the assembler itself doesn't emit SSE legacy code on AVX machines, this instruction coud be emitted only on transition to native/VM code (and maybe the interpreter?). The current state generates `vzeroupper` on every function return and function call if 256-bit vector is involved, which is less than optimal. On the other hand, `clear_upper_avx` only emits `vzeroupper` on AVX2, when we clearly have 256-bit vectors on AVX1?

Please correct me if I miss something important here, thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7770