AVX 256 instructions in JDK7?

Tue Mar 26 12:33:17 PDT 2013

Hi Ondrej

we need more context from you.

First, what java version you are experimented with?

Currently AVX wide vectors usage is only available in EA jdk8.
In 7u4 was added only usage of VEX prefix.

The "anonymous memory" is most likely our CodeCache where JIT places 
compiled native code.

Could you clarify your statement?:

 >> The origin of almost all bad transitions from AVX -> SSE (I mean the 
code that uses
 >> AVX 256 instructions and does not call VZEROUPPER) is somewhere inside
 >> anonymous memory blocks (according to Intel's SDE and pmem).

Do you have call stacks?
Does the problem happened only in this code or during calls into Libav?

 >> to the same libraries with AVX disabled. The problem is definitely in
 >> "bad" transitions between SSE and AVX instructions. These transitions
 >> are costly in case the upper part of YMM registers is not zeroed using
 >> VZEROUPPER or VZEROALL instruction before using SSE. More details at
 >> http://agner.org/optimize/optimizing_assembly.pdf (section 13.6).

As I understand, it is only problem if you mix SSE and AVX instructions 
in the same loop. Our JIT should not do that. It generates only AVX 
instructions (with VEX prefix) if AVX is available.

Regards,
Vladimir

On 3/26/13 11:34 AM, Christian Thalinger wrote:
> [That question should go to hotspot-compiler-dev.  BCC'ed
> hotspot-runtime-dev.]
>
> On Mar 26, 2013, at 8:20 AM, Ondřej Perutka <perutka.ondrej at gmail.com
> <mailto:perutka.ondrej at gmail.com>> wrote:
>
>> Hello,
>> I've got a question related to my project. It is Java wrapper for
>> Libav libraries and I have some performance issues with it.
>>
>> If I compile the libraries with AVX instructions enabled the whole
>> testing application uses approximately 130% of CPU time in comparison
>> to the same libraries with AVX disabled. The problem is definitely in
>> "bad" transitions between SSE and AVX instructions. These transitions
>> are costly in case the upper part of YMM registers is not zeroed using
>> VZEROUPPER or VZEROALL instruction before using SSE. More details at
>> http://agner.org/optimize/optimizing_assembly.pdf (section 13.6).
>>
>> There is no problem with those libraries, if they are not used from
>> Java. I used Intel's Software Developer Emulator to find those bad AVX
>> <-> SSE transitions and I found thousands of them. The origin of
>> almost all bad transitions from AVX -> SSE (I mean the code that uses
>> AVX 256 instructions and does not call VZEROUPPER) is somewhere inside
>> anonymous memory blocks (according to Intel's SDE and pmem).
>>
>> Libav mixes SSE and AVX 128 instructions a lot. It cannot cause any
>> trouble if the upper part of YMM registers is zeroed. But in case it
>> is not zeroed it would oscillate between B and C states (according to
>> the Agner's terminology). Both of these transitions costs quite a lot
>> of CPU cycles.
>>
>> So here is my question: Is it possible that JIT compiler compiles some
>> bytecode into native instructions, uses some AVX 256 instrucitons,
>> does not use VZEROUPPER and puts the result into some anonymous memory
>> block?
>>
>> Ondrej Perutka
>