AVX 256 instructions in JDK7?

Wed Mar 27 16:28:39 PDT 2013

Ondrej, thank you for the detailed report.

So it happens because of call into native library.
I surprise that a call which is executed only once is causing this.

Question: can you build Hotspot VM if I give you changes to test?

There is place in VM where it does some resetting when calling native 
methods. Executing vzeroupper at that place may fix your problem.

I attached patch which you can try.

Regards,
Vladimir

On 3/27/13 1:54 PM, Ondřej Perutka wrote:
> Hello again,
> I don't want to speak too soon but I think I found the problem. This
> mail is a bit longer, so first of all let me answer your questions (feel
> free to skip this part :-)).
>
>     First, what java version you are experimented with?
>
>
> java version "1.7.0_09-icedtea"
> OpenJDK Runtime Environment (fedora-2.3.8.0.fc17-x86_64)
> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
>
>
>     Could you clarify your statement?:
>
>
> Attached is output from the Intel's Software Development Emulator and
> memory map of the emulated application. There are listed time consuming
> transitions between AVX and SSE instructions in the SDE output.
> Important are lines with non-zero number of "Dynamic AVX to SSE
> Transitions". This means transition from the state B to the state C of
> YMM registers (using Agner's terminology). Each line has address of the
> "State Change Block". In this case it should be address of the block
> which put the registers into the state B (and did not call vzeroupper to
> put them into the state A). Almost all of those addresses points
> somewhere into an anonymous memory.
>
>
>     Do you have call stacks?
>
>
> Unfortunately I cannot get call stacks because it's almost impossible to
> debug JVM running within the SDE (it's terribly slow, it takes several
> minutes even to start my application). And without SDE I cannot get
> those "State Change Block" addresses pointing to the anonymous memory so
> I don't know where to put breakpoints.
>
>
>     Does the problem happened only in this code or during calls into Libav?
>
> I'm not sure I understand your question so I hope the answer is
> satisfactory :-) : I have not checked any other code but I'm pretty sure
> it would cause trouble in any other call to a native code which mixes
> SSE and AVX 128 instructions.
>
>
>     As I understand, it is only problem if you mix SSE and AVX
>     instructions in the same loop. Our JIT should not do that. It
>     generates only AVX instructions (with VEX prefix) if AVX is available.
>
>
>   It's not only problem of some loop it makes problems to the whole
> thread. (OS is responsible for clearing upper half of YMM registers on
> context switching, so it should not affect other threads.) The whole
> problem starts if there is some VEX prefixed 256 bit instruction not
> followed by the VZEROUPPER instruction. It leaves upper half of some YMM
> register used. And if it is followed by some block of SSE instructions
> or even worse by some block which mixes SSE and 128 bit AVX instructions
> it will cause time consuming transition(s) between states of YMM registers.
>
> Here is a problematic example:
> ---------------------------------------------
>
> ...
> # some 256 bit AVX instruction:
> AVX 256
> ...
> # call to a function which mixes SSE and 128 bit AVX instructions:
> AVX 128
> SSE
> AVX 128
> SSE
> AVX 128
> ...
>
> And here is a correct example:
> ---------------------------------------------
>
> ...
> # some 256 bit AVX instruction:
> AVX 256
> ...
> vzeroupper
> ...
> # call to a function which mixes SSE and 128 bit AVX instructions:
> AVX 128
> SSE
> AVX 128
> SSE
> AVX 128
> ...
>
> ###############################
>
> OK, that's all about your e-mail, now about my findings. Thanks to Peter
> Levart, who wrote me it's possible to disable AVX instructions within
> JVM using the -XX:UseAVX=0 command line option, I was able to filter
> noise from the SDE output. The rest was the problematic spot. It is the
> _dl_x86_64_save_sse function from the ld-2.15.so <http://ld-2.15.so>
> library. This function is called (indirectly) from the
> Java_sun_font_NativeFont_fontExists function which is a part of the
> libfontmanager.so library. I believe it's a part of OpenJDK since it's
> located in my OpenJDK directory tree. Here is the call stack:
>
> #0 0x0000003579214e80 in _dl_x86_64_save_sse () from
> /lib64/ld-linux-x86-64.so.2
> #1 0x000000357920ad09 in _dl_lookup_symbol_x () from
> /lib64/ld-linux-x86-64.so.2
> #2 0x000000357920e2d4 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
> #3 0x00000035792148e5 in _dl_runtime_resolve () from
> /lib64/ld-linux-x86-64.so.2
> #4 0x00007fffdea841a8 in Java_sun_font_NativeFont_fontExists ()
> from
> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/lib/amd64/libfontmanager.so
> #5 0x00007fffed011f90 in ?? ()
> #6 0x00007fffed005410 in ?? ()
> ...
>
> The Java_sun_font_NativeFont_fontExists function was called only once
> right after my application started.
>
> To prove it's the cause of all problems I made a little hack to the
> libfontmanager.so and added the vzeroupper instruction opcode before
> return from the Java_sun_font_NativeFont_fontExists. (I also removed
> corresponding nop padding from the end of the function.) Everything
> works fine since then even with enabled AVX instructions. It seems this
> forgotten call caused literally a chain reaction. Attached is output
> from the emulator made after this hack. You can compare it with the
> previous one. (There are still some conflicts because the vzeroupper is
> not called immediately after return from the _dl_runtime_resolve function.)
>
> I don't know whether the _dl_runtime_resolve function called directly
> from the Java_sun_font_NativeFont_fontExists function is a part of ld's
> public API or not, so I don't know whether the bug is on side of OpenJDK
> or binutils.
>
> Regards,
> Ondrej
>
>
> 2013/3/26 Vladimir Kozlov <vladimir.kozlov at oracle.com
> <mailto:vladimir.kozlov at oracle.com>>
>
>     Hi Ondrej
>
>     we need more context from you.
>
>     First, what java version you are experimented with?
>
>     Currently AVX wide vectors usage is only available in EA jdk8.
>     In 7u4 was added only usage of VEX prefix.
>
>     The "anonymous memory" is most likely our CodeCache where JIT places
>     compiled native code.
>
>     Could you clarify your statement?:
>
>
>      >> The origin of almost all bad transitions from AVX -> SSE (I mean
>     the code that uses
>      >> AVX 256 instructions and does not call VZEROUPPER) is somewhere
>     inside
>      >> anonymous memory blocks (according to Intel's SDE and pmem).
>
>     Do you have call stacks?
>     Does the problem happened only in this code or during calls into Libav?
>
>
>      >> to the same libraries with AVX disabled. The problem is
>     definitely in
>      >> "bad" transitions between SSE and AVX instructions. These
>     transitions
>      >> are costly in case the upper part of YMM registers is not zeroed
>     using
>      >> VZEROUPPER or VZEROALL instruction before using SSE. More details at
>      >> http://agner.org/optimize/optimizing_assembly.pdf (section 13.6).
>
>     As I understand, it is only problem if you mix SSE and AVX
>     instructions in the same loop. Our JIT should not do that. It
>     generates only AVX instructions (with VEX prefix) if AVX is available.
>
>     Regards,
>     Vladimir
>
>
>
>     On 3/26/13 11:34 AM, Christian Thalinger wrote:
>
>         [That question should go to hotspot-compiler-dev.  BCC'ed
>         hotspot-runtime-dev.]
>
>         On Mar 26, 2013, at 8:20 AM, Ondřej Perutka
>         <perutka.ondrej at gmail.com <mailto:perutka.ondrej at gmail.com>
>         <mailto:perutka.ondrej at gmail.com
>         <mailto:perutka.ondrej at gmail.com>>> wrote:
>
>             Hello,
>             I've got a question related to my project. It is Java
>             wrapper for
>             Libav libraries and I have some performance issues with it.
>
>             If I compile the libraries with AVX instructions enabled the
>             whole
>             testing application uses approximately 130% of CPU time in
>             comparison
>             to the same libraries with AVX disabled. The problem is
>             definitely in
>             "bad" transitions between SSE and AVX instructions. These
>             transitions
>             are costly in case the upper part of YMM registers is not
>             zeroed using
>             VZEROUPPER or VZEROALL instruction before using SSE. More
>             details at
>             http://agner.org/optimize/optimizing_assembly.pdf (section
>             13.6).
>
>             There is no problem with those libraries, if they are not
>             used from
>             Java. I used Intel's Software Developer Emulator to find
>             those bad AVX
>             <-> SSE transitions and I found thousands of them. The origin of
>             almost all bad transitions from AVX -> SSE (I mean the code
>             that uses
>             AVX 256 instructions and does not call VZEROUPPER) is
>             somewhere inside
>             anonymous memory blocks (according to Intel's SDE and pmem).
>
>             Libav mixes SSE and AVX 128 instructions a lot. It cannot
>             cause any
>             trouble if the upper part of YMM registers is zeroed. But in
>             case it
>             is not zeroed it would oscillate between B and C states
>             (according to
>             the Agner's terminology). Both of these transitions costs
>             quite a lot
>             of CPU cycles.
>
>             So here is my question: Is it possible that JIT compiler
>             compiles some
>             bytecode into native instructions, uses some AVX 256
>             instrucitons,
>             does not use VZEROUPPER and puts the result into some
>             anonymous memory
>             block?
>
>             Ondrej Perutka
>
>
>
-------------- next part --------------

--- old/src/cpu/x86/vm/assembler_x86.cpp	2013-03-27 16:23:25.000000000 -0700
+++ new/src/cpu/x86/vm/assembler_x86.cpp	2013-03-27 16:23:24.000000000 -0700
@@ -3118,6 +3118,12 @@
   emit_operand(dst, src);
 }
 
+void Assembler::vzeroupper() {
+  assert(VM_Version::supports_avx(), "");
+  (void)vex_prefix_and_encode(xmm0, xmm0, xmm0, VEX_SIMD_NONE);
+  emit_int8(0x77);
+}
+
 
 #ifndef _LP64
 // 32bit only pieces of the assembler
--- old/src/cpu/x86/vm/assembler_x86.hpp	2013-03-27 16:23:25.000000000 -0700
+++ new/src/cpu/x86/vm/assembler_x86.hpp	2013-03-27 16:23:25.000000000 -0700
@@ -1612,6 +1612,11 @@
   void vxorpd(XMMRegister dst, XMMRegister nds, Address src);
   void vxorps(XMMRegister dst, XMMRegister nds, Address src);
 
+  // AVX instruction which is used to clear upper 128 bits of YMM registers and
+  // to avoid transaction penalty between AVX and SSE states. There is no
+  // penalty if legacy SSE instructions are encoded using VEX prefix because
+  // they always clear upper 128 bits.
+  void vzeroupper();
 
  protected:
   // Next instructions require address alignment 16 bytes SSE mode.
--- old/src/cpu/x86/vm/sharedRuntime_x86_64.cpp	2013-03-27 16:23:26.000000000 -0700
+++ new/src/cpu/x86/vm/sharedRuntime_x86_64.cpp	2013-03-27 16:23:26.000000000 -0700
@@ -2103,6 +2103,10 @@
     else if (CheckJNICalls ) {
       __ call(RuntimeAddress(CAST_FROM_FN_PTR(address, StubRoutines::x86::verify_mxcsr_entry())));
     }
+    if (VM_Version::supports_avx()) {
+      // Clear upper bits of YMM registers to avoid SSE <-> AVX transition penalty.
+      __ vzeroupper();
+    }
 
 
   // Unpack native results.
--- old/src/cpu/x86/vm/templateInterpreter_x86_64.cpp	2013-03-27 16:23:26.000000000 -0700
+++ new/src/cpu/x86/vm/templateInterpreter_x86_64.cpp	2013-03-27 16:23:26.000000000 -0700
@@ -1087,6 +1087,10 @@
   else if (CheckJNICalls) {
     __ call(RuntimeAddress(CAST_FROM_FN_PTR(address, StubRoutines::x86::verify_mxcsr_entry())));
   }
+  if (VM_Version::supports_avx()) {
+    // Clear upper bits of YMM registers to avoid SSE <-> AVX transition penalty.
+    __ vzeroupper();
+  }
 
   // NOTE: The order of these pushes is known to frame::interpreter_frame_result
   // in order to extract the result of a method call. If the order of these