AVX 256 instructions in JDK7?
Ondřej Perutka
perutka.ondrej at gmail.com
Thu Mar 28 08:51:25 PDT 2013
Excellent, the patch works!
However, I got an error during compilation. (I applied the patch to the
OpenJDK version jdk7u9-b05.) It said the emit_int8() method was undefined
so I replaced it with the emit_byte() method. I hope it has the same
functionality. Attached is updated patch.
I also think the doc of the vzeroupper() method is a bit confusing because
the VEX prefixed SSE instructions leaves upper part of YMM registers
untouched. That's why it's problematic if they are combined with legacy SSE
instructions and the upper part of some YMM register is non-zero. See the
YMM transition table in section 13.6 of the Agner's publication:
http://agner.org/optimize/optimizing_assembly.pdf
Regards,
Ondrej
2013/3/28 Vladimir Kozlov <vladimir.kozlov at oracle.com>
> Ondrej, thank you for the detailed report.
>
> So it happens because of call into native library.
> I surprise that a call which is executed only once is causing this.
>
> Question: can you build Hotspot VM if I give you changes to test?
>
> There is place in VM where it does some resetting when calling native
> methods. Executing vzeroupper at that place may fix your problem.
>
> I attached patch which you can try.
>
> Regards,
> Vladimir
>
>
> On 3/27/13 1:54 PM, Ondřej Perutka wrote:
>
>> Hello again,
>> I don't want to speak too soon but I think I found the problem. This
>> mail is a bit longer, so first of all let me answer your questions (feel
>> free to skip this part :-)).
>>
>> First, what java version you are experimented with?
>>
>>
>> java version "1.7.0_09-icedtea"
>> OpenJDK Runtime Environment (fedora-2.3.8.0.fc17-x86_64)
>> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
>>
>>
>> Could you clarify your statement?:
>>
>>
>> Attached is output from the Intel's Software Development Emulator and
>> memory map of the emulated application. There are listed time consuming
>> transitions between AVX and SSE instructions in the SDE output.
>> Important are lines with non-zero number of "Dynamic AVX to SSE
>> Transitions". This means transition from the state B to the state C of
>> YMM registers (using Agner's terminology). Each line has address of the
>> "State Change Block". In this case it should be address of the block
>> which put the registers into the state B (and did not call vzeroupper to
>> put them into the state A). Almost all of those addresses points
>> somewhere into an anonymous memory.
>>
>>
>> Do you have call stacks?
>>
>>
>> Unfortunately I cannot get call stacks because it's almost impossible to
>> debug JVM running within the SDE (it's terribly slow, it takes several
>> minutes even to start my application). And without SDE I cannot get
>> those "State Change Block" addresses pointing to the anonymous memory so
>> I don't know where to put breakpoints.
>>
>>
>> Does the problem happened only in this code or during calls into
>> Libav?
>>
>> I'm not sure I understand your question so I hope the answer is
>> satisfactory :-) : I have not checked any other code but I'm pretty sure
>> it would cause trouble in any other call to a native code which mixes
>> SSE and AVX 128 instructions.
>>
>>
>> As I understand, it is only problem if you mix SSE and AVX
>> instructions in the same loop. Our JIT should not do that. It
>> generates only AVX instructions (with VEX prefix) if AVX is available.
>>
>>
>> It's not only problem of some loop it makes problems to the whole
>> thread. (OS is responsible for clearing upper half of YMM registers on
>> context switching, so it should not affect other threads.) The whole
>> problem starts if there is some VEX prefixed 256 bit instruction not
>> followed by the VZEROUPPER instruction. It leaves upper half of some YMM
>> register used. And if it is followed by some block of SSE instructions
>> or even worse by some block which mixes SSE and 128 bit AVX instructions
>> it will cause time consuming transition(s) between states of YMM
>> registers.
>>
>> Here is a problematic example:
>> ---------------------------------------------
>>
>> ...
>> # some 256 bit AVX instruction:
>> AVX 256
>> ...
>> # call to a function which mixes SSE and 128 bit AVX instructions:
>> AVX 128
>> SSE
>> AVX 128
>> SSE
>> AVX 128
>> ...
>>
>> And here is a correct example:
>> ---------------------------------------------
>>
>> ...
>> # some 256 bit AVX instruction:
>> AVX 256
>> ...
>> vzeroupper
>> ...
>> # call to a function which mixes SSE and 128 bit AVX instructions:
>> AVX 128
>> SSE
>> AVX 128
>> SSE
>> AVX 128
>> ...
>>
>> ###############################
>>
>> OK, that's all about your e-mail, now about my findings. Thanks to Peter
>> Levart, who wrote me it's possible to disable AVX instructions within
>> JVM using the -XX:UseAVX=0 command line option, I was able to filter
>> noise from the SDE output. The rest was the problematic spot. It is the
>> _dl_x86_64_save_sse function from the ld-2.15.so <http://ld-2.15.so>
>>
>> library. This function is called (indirectly) from the
>> Java_sun_font_NativeFont_fontExists function which is a part of the
>> libfontmanager.so library. I believe it's a part of OpenJDK since it's
>> located in my OpenJDK directory tree. Here is the call stack:
>>
>> #0 0x0000003579214e80 in _dl_x86_64_save_sse () from
>> /lib64/ld-linux-x86-64.so.2
>> #1 0x000000357920ad09 in _dl_lookup_symbol_x () from
>> /lib64/ld-linux-x86-64.so.2
>> #2 0x000000357920e2d4 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
>> #3 0x00000035792148e5 in _dl_runtime_resolve () from
>> /lib64/ld-linux-x86-64.so.2
>> #4 0x00007fffdea841a8 in Java_sun_font_NativeFont_fontExists ()
>> from
>>
>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/lib/amd64/libfontmanager.so
>> #5 0x00007fffed011f90 in ?? ()
>> #6 0x00007fffed005410 in ?? ()
>> ...
>>
>> The Java_sun_font_NativeFont_fontExists function was called only once
>> right after my application started.
>>
>> To prove it's the cause of all problems I made a little hack to the
>> libfontmanager.so and added the vzeroupper instruction opcode before
>> return from the Java_sun_font_NativeFont_fontExists. (I also removed
>> corresponding nop padding from the end of the function.) Everything
>> works fine since then even with enabled AVX instructions. It seems this
>> forgotten call caused literally a chain reaction. Attached is output
>> from the emulator made after this hack. You can compare it with the
>> previous one. (There are still some conflicts because the vzeroupper is
>> not called immediately after return from the _dl_runtime_resolve
>> function.)
>>
>> I don't know whether the _dl_runtime_resolve function called directly
>> from the Java_sun_font_NativeFont_fontExists function is a part of ld's
>> public API or not, so I don't know whether the bug is on side of OpenJDK
>> or binutils.
>>
>> Regards,
>> Ondrej
>>
>>
>> 2013/3/26 Vladimir Kozlov <vladimir.kozlov at oracle.com
>> <mailto:vladimir.kozlov at oracle.com>>
>>
>>
>> Hi Ondrej
>>
>> we need more context from you.
>>
>> First, what java version you are experimented with?
>>
>> Currently AVX wide vectors usage is only available in EA jdk8.
>> In 7u4 was added only usage of VEX prefix.
>>
>> The "anonymous memory" is most likely our CodeCache where JIT places
>> compiled native code.
>>
>> Could you clarify your statement?:
>>
>>
>> >> The origin of almost all bad transitions from AVX -> SSE (I mean
>> the code that uses
>> >> AVX 256 instructions and does not call VZEROUPPER) is somewhere
>> inside
>> >> anonymous memory blocks (according to Intel's SDE and pmem).
>>
>> Do you have call stacks?
>> Does the problem happened only in this code or during calls into
>> Libav?
>>
>>
>> >> to the same libraries with AVX disabled. The problem is
>> definitely in
>> >> "bad" transitions between SSE and AVX instructions. These
>> transitions
>> >> are costly in case the upper part of YMM registers is not zeroed
>> using
>> >> VZEROUPPER or VZEROALL instruction before using SSE. More details
>> at
>> >> http://agner.org/optimize/optimizing_assembly.pdf (section 13.6).
>>
>> As I understand, it is only problem if you mix SSE and AVX
>> instructions in the same loop. Our JIT should not do that. It
>> generates only AVX instructions (with VEX prefix) if AVX is available.
>>
>> Regards,
>> Vladimir
>>
>>
>>
>> On 3/26/13 11:34 AM, Christian Thalinger wrote:
>>
>> [That question should go to hotspot-compiler-dev. BCC'ed
>> hotspot-runtime-dev.]
>>
>> On Mar 26, 2013, at 8:20 AM, Ondřej Perutka
>> <perutka.ondrej at gmail.com <mailto:perutka.ondrej at gmail.com>
>> <mailto:perutka.ondrej at gmail.com
>>
>> <mailto:perutka.ondrej at gmail.com>>> wrote:
>>
>> Hello,
>> I've got a question related to my project. It is Java
>> wrapper for
>> Libav libraries and I have some performance issues with it.
>>
>> If I compile the libraries with AVX instructions enabled the
>> whole
>> testing application uses approximately 130% of CPU time in
>> comparison
>> to the same libraries with AVX disabled. The problem is
>> definitely in
>> "bad" transitions between SSE and AVX instructions. These
>> transitions
>> are costly in case the upper part of YMM registers is not
>> zeroed using
>> VZEROUPPER or VZEROALL instruction before using SSE. More
>> details at
>> http://agner.org/optimize/optimizing_assembly.pdf (section
>> 13.6).
>>
>> There is no problem with those libraries, if they are not
>> used from
>> Java. I used Intel's Software Developer Emulator to find
>> those bad AVX
>> <-> SSE transitions and I found thousands of them. The origin
>> of
>> almost all bad transitions from AVX -> SSE (I mean the code
>> that uses
>> AVX 256 instructions and does not call VZEROUPPER) is
>> somewhere inside
>> anonymous memory blocks (according to Intel's SDE and pmem).
>>
>> Libav mixes SSE and AVX 128 instructions a lot. It cannot
>> cause any
>> trouble if the upper part of YMM registers is zeroed. But in
>> case it
>> is not zeroed it would oscillate between B and C states
>> (according to
>> the Agner's terminology). Both of these transitions costs
>> quite a lot
>> of CPU cycles.
>>
>> So here is my question: Is it possible that JIT compiler
>> compiles some
>> bytecode into native instructions, uses some AVX 256
>> instrucitons,
>> does not use VZEROUPPER and puts the result into some
>> anonymous memory
>> block?
>>
>> Ondrej Perutka
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20130328/42c824aa/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hsx23.patch
Type: application/octet-stream
Size: 2455 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20130328/42c824aa/hsx23-0001.patch
More information about the hotspot-compiler-dev
mailing list