JVM support for arbitrary machine code loading and execution

Fri Dec 4 17:04:22 UTC 2015

Hi,

FYI I've been exploring machine code snippets recently:

    hs: http://hg.openjdk.java.net/panama/panama/hotspot/rev/cd901fd90596
   jdk: http://hg.openjdk.java.net/panama/panama/jdk/rev/091cdacf28e5

Samples:
  [1] MachineCodeSnippetSamples

http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/MachineCodeSnippetSamples.java

  [2] CPUID

http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/CPUID.java

  [3] VectorUtils

http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorUtils.java

  [4] VectorizedHashCode

http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorizedHashCode.java

The main assumption is that JVM doesn't know anything about the machine 
code being loaded, but use it "as is".

It provides an easy route to experiment with hardware-specific 
instruction sequences without any JVM modifications. As an alternative, 
it is possible to write desired code shape in native code and call it 
using JNI (or native method handles in Panama). But, unfortunately, it 
doesn't work well for small sequences (think about up to a dozen 
instructions long) where invocation and argument marshaling overhead 
dominates the actual work. So far, the only way was to teach JVM 
JIT-compilers new instructions and introduce JVM intrinsics, but it is 
quite laborious task.

In order to wire a machine code snippet to Java code, some implicit 
conventions about arguments and return value mappings are needed 
(calling convention). Method type of the corresponding method handle 
plays central role here and provides complete description of the wiring.

Current prototype focuses on x64. The conventions are based on System V 
AMD64 ABI (6 regs for int/pointer arguments + 8 regs for floating-point 
arguments), but with some adjustments (more on it later).

The implementation heavily relies on native method handles, but also 
introduces specific optimizations in C2 (e.g. snippet inlining).

HOW TO USE

A user has to provide (1) machine code snippet; and (2) method type:

    MethodHandle jdk.internal.panama.CodeSnippet.make(
	String     name,
	MethodType type,
	boolean    isSupported,
	int...     code)

The framework wraps the machine code and returns a method handle of 
requested type. The type is used to map Java arguments to registers.

Constructed method handle is fully functional and can be bound to 
invokedynamic instruction or invoked using MH.invokeExact(), 
MH.invoke(), or MH.invokeWithArguments() methods.

For convenience purposes, when isSupported=false is passed, the 
framework constructs a method handle which always throws an exception.

Also, for diagnostic purposes, it's possible to assign a name to the 
snippet.

EXAMPLE

256-bit memory-to-memory move with double-register addressing mode:

   MethodHandle mov256MH = CodeSnippet.make("move256",
     MethodType.methodType(void.class,            // return type
                           Object.class /*rdi*/,  // src
                           long.class   /*rsi*/,  // offset
                           Object.class /*rdx*/,  // dst
                           long.class   /*rcx*/), // offset
     CPUID.has(AVX),
     0xC4, 0xE1, 0x7E, 0x6F, 0x04, 0x37,  // vmovdqu ymm0,[rsi+rdi]
     0xC4, 0xE1, 0x7E, 0x7F, 0x04, 0x0A); // vmovdqu [rdx+rcx],ymm0

   static void move256(Object src, long off1, Object dst, long off2) {
       try {
           mov256MH.invokeExact(src, off1, dst, off2);
       } catch (Throwable e) {
           throw new Error(e);
       }
   }

   byte[] src = ...; long off1 = ...;
   byte[] dst = ...; long off2 = ...;
   move256(src, off1, dst, off2);

Other examples: MachineCodeSnippetSamples [1], CPUID [2].

IMPLEMENTATION DETAILS

There are 2 execution modes:
   (1) non-optimized (called from interpreter/C1/non-constant MH in C2);

   (2) optimized (a snippet is inlined by C2).

Non-optimized case is supported by a stand-alone version generated 
on-the-fly as a native function.

C2 has a choice either to issue a direct native call or try to inline 
the snippet.

   Stand-alone snippet version for move256 (-XX:+PrintCodeSnippets):
     Decoding code snippet "mov256" @ 0x1049bc460
       0x1049bc460: push   %rbp
       0x1049bc461: mov    %rsp,%rbp
       0x1049bc464: vmovdqu (%rdi,%rsi,1),%ymm0
       0x1049bc46a: vmovdqu %ymm0,(%rdx,%rcx,1)
       0x1049bc470: leaveq
       0x1049bc471: retq

   C2 compiled version (w/ snippet inline):
     # {method} {0x115c2f880} 'mov256' 
'(Ljava/lang/Object;JLjava/lang/Object;J)V'
     # parm0:    rsi:rsi   = 'java/lang/Object'
     # parm1:    rdx:rdx   = long
     # parm2:    rcx:rcx   = 'java/lang/Object'
     # parm3:    r8:r8     = long
     #           [sp+0x20]  (sp of caller)

     0x1051bd560: mov    %eax,-0x16000(%rsp)
     0x1051bd567: push   %rbp
     0x1051bd568: sub    $0x10,%rsp

     0x1051bd56c: mov    %rsi,%rdi
     0x1051bd56f: mov    %rdx,%rsi
     0x1051bd572: mov    %rcx,%rdx
     0x1051bd575: mov    %r8,%rcx

     0x1051bd578: vmovdqu (%rdi,%rsi,1),%ymm0
     0x1051bd57e: vmovdqu %ymm0,(%rdx,%rcx,1)

     0x1051bd584: add    $0x10,%rsp
     0x1051bd588: pop    %rbp
     0x1051bd589: test   %eax,-0x4d3d58f(%rip)

     0x1051bd58f: retq

VECTOR SUPPORT

Initially, my main motivation was to play with vector instructions and 
experiment how to efficiently pass vectors values around in Java & 
generated code.

There are 3 new wrappers: java.lang.Long2, java.long.Long4, and 
java.long.Long8, which represent 128-, 256-, and 512-bit vector values 
respectively.

When j.l.Long2/4/8 argument is passed into the snippet, it is unboxed 
into a vector register (xmm*/ymm*/zmm*). (It means they clash with 
floating-point arguments, so they are considered as floating-point 
arguments by register allocator). If vector value is returned, it is 
expected in xmm0/ymm0/zmm0 (depending on its size). It is automatically 
copied into a preallocated box which is passed as a first Object 
argument into the snippet. The box for return value is allocated 
implicitly on every invocation.

Example: vpadd instruction (MachineCodeSnippetSamples [1]):

   MethodHandle vpaddMH = CodeSnippet.make("vpaddd",
         MethodType.methodType(Long4.class, Long4.class, Long4.class),
	requires(AVX),
         0xC5, 0xF5, 0xFE, 0xC0); // vpaddd %ymm0, %ymm1, %ymm0

   public static Long4 vadd(Long4 v1, Long4 v2) {
       try {
           return (Long4)MHm256_vadd.invokeExact(v1, v2);
       } catch (Throwable e) {
           throw new Error(e);
       }
   }

Stand-alone snippet version:

       0x1049bc4e0: push   %rbp
       0x1049bc4e1: mov    %rsp,%rbp
       0x1049bc4e4: vmovdqu 0x10(%rsi),%ymm0 ; unbox arguments
       0x1049bc4e9: vmovdqu 0x10(%rdx),%ymm1

       0x1049bc4ee: vpaddd %ymm0,%ymm1,%ymm0 ; snippet

       0x1049bc4f2: mov    %rdi,%rax
       0x1049bc4f5: vmovdqu %ymm0,0x10(%rax) ; box return value
       0x1049bc4fa: leaveq
       0x1049bc4fb: retq

C2 compiled version (w/ snippet inline):

       # {method} {0x115b2c0e8} 'vadd' 
'(Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;'
       # parm0:    rsi:rsi   = 'java/lang/Long4'
       # parm1:    rdx:rdx   = 'java/lang/Long4'
       #           [sp+0x30]  (sp of caller)
       [...]
       0x1051c9a73: movabs $0x7c0011dc8,%rsi  ; 
{metadata('java/lang/Long4')}
       0x1051c9a7d: nop
       0x1051c9a7e: nop
       0x1051c9a7f: nop
       0x1051c9a80: vzeroupper
       0x1051c9a83: callq  0x10516f260        ;   {runtime_call 
_new_instance_Java}

       0x1051c9a88: mov    %rax,%rbx          ; unbox arguments
       0x1051c9a8b: vmovdqu 0x10(%rbp),%ymm0  ;
       0x1051c9a90: mov    (%rsp),%r10        ;
       0x1051c9a94: vmovdqu 0x10(%r10),%ymm1  ;

       0x1051c9a9a: vpaddd %ymm0,%ymm1,%ymm0  ; snippet

       0x1051c9a9e: vmovdqu %ymm0,0x10(%rbx)  ; box return value
       0x1051c9aa3: mov    %rbx,%rax

       [...]

       0x1051c9ab4: retq

C2 aggressively tries to eliminate excessive boxing-unboxing when 
multiple snippets are used, so most of the allocations are actually 
eliminated and values are passed in vector registers:

   Long4 testVAdd(Long4 v1, Long4 v2, Long4 v3) {
       return vadd(vadd(v1, v2), v3);
   }

       # {method} {0x11532ca28} 'testVAdd' 
'(Ljava/lang/Long4;Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;' 
in 'Main'
       # parm0:    rsi:rsi   = 'java/lang/Long4'
       # parm1:    rdx:rdx   = 'java/lang/Long4'
       # parm2:    rcx:rcx   = 'java/lang/Long4'
       #           [sp+0x40]  (sp of caller)
       [...]
       0x1049d28ec: mov    %rcx,%rbp
       0x1049d28ef: vmovdqu 0x10(%rdx),%ymm1
       0x1049d28f4: vmovdqu 0x10(%rsi),%ymm0
       0x1049d28f9: vpaddd %ymm0,%ymm1,%ymm0  ; snippet #1
       0x1049d28fd: vmovdqu %ymm0,(%rsp)

       0x1049d2902: movabs $0x7c0011dc8,%rsi  ; 
{metadata('java/lang/Long4')}
       0x1049d290c: vzeroupper
       0x1049d290f: callq  0x10496fae0        ;   {runtime_call 
_new_instance_Java}

       0x1049d2914: mov    %rax,%rbx
       0x1049d2917: vmovdqu 0x10(%rbp),%ymm1
       0x1049d291c: vmovdqu (%rsp),%ymm0

       0x1049d2921: vpaddd %ymm0,%ymm1,%ymm0  ; snippet #2

       0x1049d2925: vmovdqu %ymm0,0x10(%rbx)

       0x1049d292a: mov    %rbx,%rax

       [...]

       0x1049d293b: retq

For a more complex example see vectorized hash implementation 
(VectorizedHashCode [4]).

IDEAS FOR FURTHER IMPROVEMENTS

Keep in mind that the prototype is raw and, though it passes JPRT, there 
were no extensive testing done.

There are some inefficiencies in generated code:

   (1) too many spills of vector values;
It is a direct consequence of current code snippet representation as a 
native call on IR level. It has unfortunate effect on XMM registers 
which aren't preserved across function calls. Of course, CallSnippet 
node conventions can be adjusted and some vector registers are made 
callee-save.

   (2) C2 doesn't constant fold loads from value boxes;
Constant vector boxes aren't constant folded and stay as embedded oops + 
vector loads on every usage. It would be nice to avoid loading values 
from memory and inject them directly into generated code.

One of the root causes of (1) is that the input registers are fixed, so 
there's contention on them when multiple snippets are in play. Register 
renaming in the snippets would alleviate the problem, but requires JVM 
to be able to parse and modify machine code of the snippet when it is 
inlined.

Also, ability to describe machine code effects (e.g. memory state 
change) would allow more efficient code generated as well.

Stay tuned! Thanks!

Best regards,
Vladimir Ivanov