AVX 256 instructions in JDK7?

Tue Mar 26 08:20:42 PDT 2013

Hello,
I've got a question related to my project. It is Java wrapper for Libav
libraries and I have some performance issues with it.

If I compile the libraries with AVX instructions enabled the whole testing
application uses approximately 130% of CPU time in comparison to the same
libraries with AVX disabled. The problem is definitely in "bad" transitions
between SSE and AVX instructions. These transitions are costly in case the
upper part of YMM registers is not zeroed using VZEROUPPER or VZEROALL
instruction before using SSE. More details at
http://agner.org/optimize/optimizing_assembly.pdf (section 13.6).

There is no problem with those libraries, if they are not used from Java. I
used Intel's Software Developer Emulator to find those bad AVX <-> SSE
transitions and I found thousands of them. The origin of almost all bad
transitions from AVX -> SSE (I mean the code that uses AVX 256 instructions
and does not call VZEROUPPER) is somewhere inside anonymous memory blocks
(according to Intel's SDE and pmem).

Libav mixes SSE and AVX 128 instructions a lot. It cannot cause any trouble
if the upper part of YMM registers is zeroed. But in case it is not zeroed
it would oscillate between B and C states (according to the Agner's
terminology). Both of these transitions costs quite a lot of CPU cycles.

So here is my question: Is it possible that JIT compiler compiles some
bytecode into native instructions, uses some AVX 256 instrucitons, does not
use VZEROUPPER and puts the result into some anonymous memory block?

Ondrej Perutka
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/attachments/20130326/29c83bc3/attachment.html