RFR: 8272493: Suboptimal code generation around Preconditions.checkIndex intrinsic with AVX2
Quan Anh Mai
duke at openjdk.java.net
Thu Mar 10 08:22:41 UTC 2022
On Thu, 10 Mar 2022 07:55:16 GMT, Yi Yang <yyang at openjdk.org> wrote:
> 8272493 reports a minor regression when using Preconditions.checkIndex in String.checkIndex. The reason is some unnecessary vzeroupper instructions were emitted. The vzerouppers are introduced in [JDK-8190934](https://bugs.openjdk.java.net/browse/JDK-8190934), which are emitted by clear_upper_avx within inline_preconditions_checkIndex. I did some digging into the history of this code. Please correct me if I misunderstand something
>
> [JDK-8178811](https://bugs.openjdk.java.net/browse/JDK-8178811) emits vzeroupper on every MachEpilogueNode to avoid AVX <-> SSE transition penalty during the call.
>
> [JDK-8190934](https://bugs.openjdk.java.net/browse/JDK-8190934) emits vzeroupper on some MachEpilogueNode by setting clear_upper_avx flag, because vzeroupper itself is a high-cost instruction, we don't want to emit it everywhere a function is finished.
>
> [JDK-8272493](https://bugs.openjdk.java.net/browse/JDK-8272493) emits vzeroupper because inline_preconditions_checkIndex sets clear_upper_avx flag.
>
> Micro benchmark are as follows
>
> -------Preconditions.checkIndex without clear_upper_avx
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.257 ± 0.011 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.251 ± 0.008 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.254 ± 0.003 ns/op
>
> -------Preconditions.checkIndex with clear_upper_avx(Current Implementation)
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.421 ± 0.003 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.419 ± 0.002 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.433 ± 0.044 ns/op
>
> ------- -XX:DisableIntrinsic=_Preconditions_checkIndex
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.229 ± 0.018 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.224 ± 0.006 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.218 ± 0.011 ns/op
>
> ------- -XX:UseAVX=1
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.247 ± 0.022 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.234 ± 0.018 ns/op
>
> Benchmark Mode Cnt Score Error Units
> StringBuilders.charAtLatin1 avgt 15 6.261 ± 0.042 ns/op
>
> As I understand, inline_Preconditions_checkIndex only do some simple range check, there is no xmm(sse)/ymm(avx)
> registers involved, so I propose to remove clear_upper_avx flag to avoid emitting vzeroupper for this intrinsic.
IMO the hotspot seems to be too conservative and yet not cover all the cases regarding the generation of `vzeroupper`. Given the assembler itself doesn't emit SSE legacy code on AVX machines, this instruction coud be emitted only on transition to native/VM code (and maybe the interpreter?). The current state generates `vzeroupper` on every function return and function call if 256-bit vector is involved, which is less than optimal. On the other hand, `clear_upper_avx` only emits `vzeroupper` on AVX2, when we clearly have 256-bit vectors on AVX1?
Please correct me if I miss something important here, thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7770
More information about the hotspot-compiler-dev
mailing list