AVX 256 instructions in JDK7?
Vladimir Kozlov
vladimir.kozlov at oracle.com
Wed Mar 27 16:28:39 PDT 2013
Ondrej, thank you for the detailed report.
So it happens because of call into native library.
I surprise that a call which is executed only once is causing this.
Question: can you build Hotspot VM if I give you changes to test?
There is place in VM where it does some resetting when calling native
methods. Executing vzeroupper at that place may fix your problem.
I attached patch which you can try.
Regards,
Vladimir
On 3/27/13 1:54 PM, Ondřej Perutka wrote:
> Hello again,
> I don't want to speak too soon but I think I found the problem. This
> mail is a bit longer, so first of all let me answer your questions (feel
> free to skip this part :-)).
>
> First, what java version you are experimented with?
>
>
> java version "1.7.0_09-icedtea"
> OpenJDK Runtime Environment (fedora-2.3.8.0.fc17-x86_64)
> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
>
>
> Could you clarify your statement?:
>
>
> Attached is output from the Intel's Software Development Emulator and
> memory map of the emulated application. There are listed time consuming
> transitions between AVX and SSE instructions in the SDE output.
> Important are lines with non-zero number of "Dynamic AVX to SSE
> Transitions". This means transition from the state B to the state C of
> YMM registers (using Agner's terminology). Each line has address of the
> "State Change Block". In this case it should be address of the block
> which put the registers into the state B (and did not call vzeroupper to
> put them into the state A). Almost all of those addresses points
> somewhere into an anonymous memory.
>
>
> Do you have call stacks?
>
>
> Unfortunately I cannot get call stacks because it's almost impossible to
> debug JVM running within the SDE (it's terribly slow, it takes several
> minutes even to start my application). And without SDE I cannot get
> those "State Change Block" addresses pointing to the anonymous memory so
> I don't know where to put breakpoints.
>
>
> Does the problem happened only in this code or during calls into Libav?
>
> I'm not sure I understand your question so I hope the answer is
> satisfactory :-) : I have not checked any other code but I'm pretty sure
> it would cause trouble in any other call to a native code which mixes
> SSE and AVX 128 instructions.
>
>
> As I understand, it is only problem if you mix SSE and AVX
> instructions in the same loop. Our JIT should not do that. It
> generates only AVX instructions (with VEX prefix) if AVX is available.
>
>
> It's not only problem of some loop it makes problems to the whole
> thread. (OS is responsible for clearing upper half of YMM registers on
> context switching, so it should not affect other threads.) The whole
> problem starts if there is some VEX prefixed 256 bit instruction not
> followed by the VZEROUPPER instruction. It leaves upper half of some YMM
> register used. And if it is followed by some block of SSE instructions
> or even worse by some block which mixes SSE and 128 bit AVX instructions
> it will cause time consuming transition(s) between states of YMM registers.
>
> Here is a problematic example:
> ---------------------------------------------
>
> ...
> # some 256 bit AVX instruction:
> AVX 256
> ...
> # call to a function which mixes SSE and 128 bit AVX instructions:
> AVX 128
> SSE
> AVX 128
> SSE
> AVX 128
> ...
>
> And here is a correct example:
> ---------------------------------------------
>
> ...
> # some 256 bit AVX instruction:
> AVX 256
> ...
> vzeroupper
> ...
> # call to a function which mixes SSE and 128 bit AVX instructions:
> AVX 128
> SSE
> AVX 128
> SSE
> AVX 128
> ...
>
> ###############################
>
> OK, that's all about your e-mail, now about my findings. Thanks to Peter
> Levart, who wrote me it's possible to disable AVX instructions within
> JVM using the -XX:UseAVX=0 command line option, I was able to filter
> noise from the SDE output. The rest was the problematic spot. It is the
> _dl_x86_64_save_sse function from the ld-2.15.so <http://ld-2.15.so>
> library. This function is called (indirectly) from the
> Java_sun_font_NativeFont_fontExists function which is a part of the
> libfontmanager.so library. I believe it's a part of OpenJDK since it's
> located in my OpenJDK directory tree. Here is the call stack:
>
> #0 0x0000003579214e80 in _dl_x86_64_save_sse () from
> /lib64/ld-linux-x86-64.so.2
> #1 0x000000357920ad09 in _dl_lookup_symbol_x () from
> /lib64/ld-linux-x86-64.so.2
> #2 0x000000357920e2d4 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
> #3 0x00000035792148e5 in _dl_runtime_resolve () from
> /lib64/ld-linux-x86-64.so.2
> #4 0x00007fffdea841a8 in Java_sun_font_NativeFont_fontExists ()
> from
> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/lib/amd64/libfontmanager.so
> #5 0x00007fffed011f90 in ?? ()
> #6 0x00007fffed005410 in ?? ()
> ...
>
> The Java_sun_font_NativeFont_fontExists function was called only once
> right after my application started.
>
> To prove it's the cause of all problems I made a little hack to the
> libfontmanager.so and added the vzeroupper instruction opcode before
> return from the Java_sun_font_NativeFont_fontExists. (I also removed
> corresponding nop padding from the end of the function.) Everything
> works fine since then even with enabled AVX instructions. It seems this
> forgotten call caused literally a chain reaction. Attached is output
> from the emulator made after this hack. You can compare it with the
> previous one. (There are still some conflicts because the vzeroupper is
> not called immediately after return from the _dl_runtime_resolve function.)
>
> I don't know whether the _dl_runtime_resolve function called directly
> from the Java_sun_font_NativeFont_fontExists function is a part of ld's
> public API or not, so I don't know whether the bug is on side of OpenJDK
> or binutils.
>
> Regards,
> Ondrej
>
>
> 2013/3/26 Vladimir Kozlov <vladimir.kozlov at oracle.com
> <mailto:vladimir.kozlov at oracle.com>>
>
> Hi Ondrej
>
> we need more context from you.
>
> First, what java version you are experimented with?
>
> Currently AVX wide vectors usage is only available in EA jdk8.
> In 7u4 was added only usage of VEX prefix.
>
> The "anonymous memory" is most likely our CodeCache where JIT places
> compiled native code.
>
> Could you clarify your statement?:
>
>
> >> The origin of almost all bad transitions from AVX -> SSE (I mean
> the code that uses
> >> AVX 256 instructions and does not call VZEROUPPER) is somewhere
> inside
> >> anonymous memory blocks (according to Intel's SDE and pmem).
>
> Do you have call stacks?
> Does the problem happened only in this code or during calls into Libav?
>
>
> >> to the same libraries with AVX disabled. The problem is
> definitely in
> >> "bad" transitions between SSE and AVX instructions. These
> transitions
> >> are costly in case the upper part of YMM registers is not zeroed
> using
> >> VZEROUPPER or VZEROALL instruction before using SSE. More details at
> >> http://agner.org/optimize/optimizing_assembly.pdf (section 13.6).
>
> As I understand, it is only problem if you mix SSE and AVX
> instructions in the same loop. Our JIT should not do that. It
> generates only AVX instructions (with VEX prefix) if AVX is available.
>
> Regards,
> Vladimir
>
>
>
> On 3/26/13 11:34 AM, Christian Thalinger wrote:
>
> [That question should go to hotspot-compiler-dev. BCC'ed
> hotspot-runtime-dev.]
>
> On Mar 26, 2013, at 8:20 AM, Ondřej Perutka
> <perutka.ondrej at gmail.com <mailto:perutka.ondrej at gmail.com>
> <mailto:perutka.ondrej at gmail.com
> <mailto:perutka.ondrej at gmail.com>>> wrote:
>
> Hello,
> I've got a question related to my project. It is Java
> wrapper for
> Libav libraries and I have some performance issues with it.
>
> If I compile the libraries with AVX instructions enabled the
> whole
> testing application uses approximately 130% of CPU time in
> comparison
> to the same libraries with AVX disabled. The problem is
> definitely in
> "bad" transitions between SSE and AVX instructions. These
> transitions
> are costly in case the upper part of YMM registers is not
> zeroed using
> VZEROUPPER or VZEROALL instruction before using SSE. More
> details at
> http://agner.org/optimize/optimizing_assembly.pdf (section
> 13.6).
>
> There is no problem with those libraries, if they are not
> used from
> Java. I used Intel's Software Developer Emulator to find
> those bad AVX
> <-> SSE transitions and I found thousands of them. The origin of
> almost all bad transitions from AVX -> SSE (I mean the code
> that uses
> AVX 256 instructions and does not call VZEROUPPER) is
> somewhere inside
> anonymous memory blocks (according to Intel's SDE and pmem).
>
> Libav mixes SSE and AVX 128 instructions a lot. It cannot
> cause any
> trouble if the upper part of YMM registers is zeroed. But in
> case it
> is not zeroed it would oscillate between B and C states
> (according to
> the Agner's terminology). Both of these transitions costs
> quite a lot
> of CPU cycles.
>
> So here is my question: Is it possible that JIT compiler
> compiles some
> bytecode into native instructions, uses some AVX 256
> instrucitons,
> does not use VZEROUPPER and puts the result into some
> anonymous memory
> block?
>
> Ondrej Perutka
>
>
>
-------------- next part --------------
--- old/src/cpu/x86/vm/assembler_x86.cpp 2013-03-27 16:23:25.000000000 -0700
+++ new/src/cpu/x86/vm/assembler_x86.cpp 2013-03-27 16:23:24.000000000 -0700
@@ -3118,6 +3118,12 @@
emit_operand(dst, src);
}
+void Assembler::vzeroupper() {
+ assert(VM_Version::supports_avx(), "");
+ (void)vex_prefix_and_encode(xmm0, xmm0, xmm0, VEX_SIMD_NONE);
+ emit_int8(0x77);
+}
+
#ifndef _LP64
// 32bit only pieces of the assembler
--- old/src/cpu/x86/vm/assembler_x86.hpp 2013-03-27 16:23:25.000000000 -0700
+++ new/src/cpu/x86/vm/assembler_x86.hpp 2013-03-27 16:23:25.000000000 -0700
@@ -1612,6 +1612,11 @@
void vxorpd(XMMRegister dst, XMMRegister nds, Address src);
void vxorps(XMMRegister dst, XMMRegister nds, Address src);
+ // AVX instruction which is used to clear upper 128 bits of YMM registers and
+ // to avoid transaction penalty between AVX and SSE states. There is no
+ // penalty if legacy SSE instructions are encoded using VEX prefix because
+ // they always clear upper 128 bits.
+ void vzeroupper();
protected:
// Next instructions require address alignment 16 bytes SSE mode.
--- old/src/cpu/x86/vm/sharedRuntime_x86_64.cpp 2013-03-27 16:23:26.000000000 -0700
+++ new/src/cpu/x86/vm/sharedRuntime_x86_64.cpp 2013-03-27 16:23:26.000000000 -0700
@@ -2103,6 +2103,10 @@
else if (CheckJNICalls ) {
__ call(RuntimeAddress(CAST_FROM_FN_PTR(address, StubRoutines::x86::verify_mxcsr_entry())));
}
+ if (VM_Version::supports_avx()) {
+ // Clear upper bits of YMM registers to avoid SSE <-> AVX transition penalty.
+ __ vzeroupper();
+ }
// Unpack native results.
--- old/src/cpu/x86/vm/templateInterpreter_x86_64.cpp 2013-03-27 16:23:26.000000000 -0700
+++ new/src/cpu/x86/vm/templateInterpreter_x86_64.cpp 2013-03-27 16:23:26.000000000 -0700
@@ -1087,6 +1087,10 @@
else if (CheckJNICalls) {
__ call(RuntimeAddress(CAST_FROM_FN_PTR(address, StubRoutines::x86::verify_mxcsr_entry())));
}
+ if (VM_Version::supports_avx()) {
+ // Clear upper bits of YMM registers to avoid SSE <-> AVX transition penalty.
+ __ vzeroupper();
+ }
// NOTE: The order of these pushes is known to frame::interpreter_frame_result
// in order to extract the result of a method call. If the order of these
More information about the hotspot-compiler-dev
mailing list