JVM support for arbitrary machine code loading and execution

Fri Dec 4 17:11:05 UTC 2015

Wow, that's awesome.  Could something like this ever be exposed to user
code (java's version of inline asm)? (ducks under the desk :))

On Fri, Dec 4, 2015 at 12:04 PM, Vladimir Ivanov <
vladimir.x.ivanov at oracle.com> wrote:

> Hi,
>
> FYI I've been exploring machine code snippets recently:
>
>    hs: http://hg.openjdk.java.net/panama/panama/hotspot/rev/cd901fd90596
>   jdk: http://hg.openjdk.java.net/panama/panama/jdk/rev/091cdacf28e5
>
> Samples:
>  [1] MachineCodeSnippetSamples
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/MachineCodeSnippetSamples.java
>
>  [2] CPUID
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/CPUID.java
>
>  [3] VectorUtils
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorUtils.java
>
>  [4] VectorizedHashCode
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorizedHashCode.java
>
> The main assumption is that JVM doesn't know anything about the machine
> code being loaded, but use it "as is".
>
> It provides an easy route to experiment with hardware-specific instruction
> sequences without any JVM modifications. As an alternative, it is possible
> to write desired code shape in native code and call it using JNI (or native
> method handles in Panama). But, unfortunately, it doesn't work well for
> small sequences (think about up to a dozen instructions long) where
> invocation and argument marshaling overhead dominates the actual work. So
> far, the only way was to teach JVM JIT-compilers new instructions and
> introduce JVM intrinsics, but it is quite laborious task.
>
> In order to wire a machine code snippet to Java code, some implicit
> conventions about arguments and return value mappings are needed (calling
> convention). Method type of the corresponding method handle plays central
> role here and provides complete description of the wiring.
>
> Current prototype focuses on x64. The conventions are based on System V
> AMD64 ABI (6 regs for int/pointer arguments + 8 regs for floating-point
> arguments), but with some adjustments (more on it later).
>
> The implementation heavily relies on native method handles, but also
> introduces specific optimizations in C2 (e.g. snippet inlining).
>
>
> HOW TO USE
>
> A user has to provide (1) machine code snippet; and (2) method type:
>
>    MethodHandle jdk.internal.panama.CodeSnippet.make(
>         String     name,
>         MethodType type,
>         boolean    isSupported,
>         int...     code)
>
> The framework wraps the machine code and returns a method handle of
> requested type. The type is used to map Java arguments to registers.
>
> Constructed method handle is fully functional and can be bound to
> invokedynamic instruction or invoked using MH.invokeExact(), MH.invoke(),
> or MH.invokeWithArguments() methods.
>
> For convenience purposes, when isSupported=false is passed, the framework
> constructs a method handle which always throws an exception.
>
> Also, for diagnostic purposes, it's possible to assign a name to the
> snippet.
>
> EXAMPLE
>
> 256-bit memory-to-memory move with double-register addressing mode:
>
>   MethodHandle mov256MH = CodeSnippet.make("move256",
>     MethodType.methodType(void.class,            // return type
>                           Object.class /*rdi*/,  // src
>                           long.class   /*rsi*/,  // offset
>                           Object.class /*rdx*/,  // dst
>                           long.class   /*rcx*/), // offset
>     CPUID.has(AVX),
>     0xC4, 0xE1, 0x7E, 0x6F, 0x04, 0x37,  // vmovdqu ymm0,[rsi+rdi]
>     0xC4, 0xE1, 0x7E, 0x7F, 0x04, 0x0A); // vmovdqu [rdx+rcx],ymm0
>
>
>   static void move256(Object src, long off1, Object dst, long off2) {
>       try {
>           mov256MH.invokeExact(src, off1, dst, off2);
>       } catch (Throwable e) {
>           throw new Error(e);
>       }
>   }
>
>   byte[] src = ...; long off1 = ...;
>   byte[] dst = ...; long off2 = ...;
>   move256(src, off1, dst, off2);
>
> Other examples: MachineCodeSnippetSamples [1], CPUID [2].
>
>
> IMPLEMENTATION DETAILS
>
> There are 2 execution modes:
>   (1) non-optimized (called from interpreter/C1/non-constant MH in C2);
>
>   (2) optimized (a snippet is inlined by C2).
>
> Non-optimized case is supported by a stand-alone version generated
> on-the-fly as a native function.
>
> C2 has a choice either to issue a direct native call or try to inline the
> snippet.
>
>   Stand-alone snippet version for move256 (-XX:+PrintCodeSnippets):
>     Decoding code snippet "mov256" @ 0x1049bc460
>       0x1049bc460: push   %rbp
>       0x1049bc461: mov    %rsp,%rbp
>       0x1049bc464: vmovdqu (%rdi,%rsi,1),%ymm0
>       0x1049bc46a: vmovdqu %ymm0,(%rdx,%rcx,1)
>       0x1049bc470: leaveq
>       0x1049bc471: retq
>
>   C2 compiled version (w/ snippet inline):
>     # {method} {0x115c2f880} 'mov256'
> '(Ljava/lang/Object;JLjava/lang/Object;J)V'
>     # parm0:    rsi:rsi   = 'java/lang/Object'
>     # parm1:    rdx:rdx   = long
>     # parm2:    rcx:rcx   = 'java/lang/Object'
>     # parm3:    r8:r8     = long
>     #           [sp+0x20]  (sp of caller)
>
>     0x1051bd560: mov    %eax,-0x16000(%rsp)
>     0x1051bd567: push   %rbp
>     0x1051bd568: sub    $0x10,%rsp
>
>     0x1051bd56c: mov    %rsi,%rdi
>     0x1051bd56f: mov    %rdx,%rsi
>     0x1051bd572: mov    %rcx,%rdx
>     0x1051bd575: mov    %r8,%rcx
>
>     0x1051bd578: vmovdqu (%rdi,%rsi,1),%ymm0
>     0x1051bd57e: vmovdqu %ymm0,(%rdx,%rcx,1)
>
>     0x1051bd584: add    $0x10,%rsp
>     0x1051bd588: pop    %rbp
>     0x1051bd589: test   %eax,-0x4d3d58f(%rip)
>
>     0x1051bd58f: retq
>
>
> VECTOR SUPPORT
>
> Initially, my main motivation was to play with vector instructions and
> experiment how to efficiently pass vectors values around in Java &
> generated code.
>
> There are 3 new wrappers: java.lang.Long2, java.long.Long4, and
> java.long.Long8, which represent 128-, 256-, and 512-bit vector values
> respectively.
>
> When j.l.Long2/4/8 argument is passed into the snippet, it is unboxed into
> a vector register (xmm*/ymm*/zmm*). (It means they clash with
> floating-point arguments, so they are considered as floating-point
> arguments by register allocator). If vector value is returned, it is
> expected in xmm0/ymm0/zmm0 (depending on its size). It is automatically
> copied into a preallocated box which is passed as a first Object argument
> into the snippet. The box for return value is allocated implicitly on every
> invocation.
>
> Example: vpadd instruction (MachineCodeSnippetSamples [1]):
>
>   MethodHandle vpaddMH = CodeSnippet.make("vpaddd",
>         MethodType.methodType(Long4.class, Long4.class, Long4.class),
>         requires(AVX),
>         0xC5, 0xF5, 0xFE, 0xC0); // vpaddd %ymm0, %ymm1, %ymm0
>
>   public static Long4 vadd(Long4 v1, Long4 v2) {
>       try {
>           return (Long4)MHm256_vadd.invokeExact(v1, v2);
>       } catch (Throwable e) {
>           throw new Error(e);
>       }
>   }
>
> Stand-alone snippet version:
>
>       0x1049bc4e0: push   %rbp
>       0x1049bc4e1: mov    %rsp,%rbp
>       0x1049bc4e4: vmovdqu 0x10(%rsi),%ymm0 ; unbox arguments
>       0x1049bc4e9: vmovdqu 0x10(%rdx),%ymm1
>
>       0x1049bc4ee: vpaddd %ymm0,%ymm1,%ymm0 ; snippet
>
>       0x1049bc4f2: mov    %rdi,%rax
>       0x1049bc4f5: vmovdqu %ymm0,0x10(%rax) ; box return value
>       0x1049bc4fa: leaveq
>       0x1049bc4fb: retq
>
> C2 compiled version (w/ snippet inline):
>
>       # {method} {0x115b2c0e8} 'vadd'
> '(Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;'
>       # parm0:    rsi:rsi   = 'java/lang/Long4'
>       # parm1:    rdx:rdx   = 'java/lang/Long4'
>       #           [sp+0x30]  (sp of caller)
>       [...]
>       0x1051c9a73: movabs $0x7c0011dc8,%rsi  ;
> {metadata('java/lang/Long4')}
>       0x1051c9a7d: nop
>       0x1051c9a7e: nop
>       0x1051c9a7f: nop
>       0x1051c9a80: vzeroupper
>       0x1051c9a83: callq  0x10516f260        ;   {runtime_call
> _new_instance_Java}
>
>       0x1051c9a88: mov    %rax,%rbx          ; unbox arguments
>       0x1051c9a8b: vmovdqu 0x10(%rbp),%ymm0  ;
>       0x1051c9a90: mov    (%rsp),%r10        ;
>       0x1051c9a94: vmovdqu 0x10(%r10),%ymm1  ;
>
>       0x1051c9a9a: vpaddd %ymm0,%ymm1,%ymm0  ; snippet
>
>       0x1051c9a9e: vmovdqu %ymm0,0x10(%rbx)  ; box return value
>       0x1051c9aa3: mov    %rbx,%rax
>
>       [...]
>
>       0x1051c9ab4: retq
>
> C2 aggressively tries to eliminate excessive boxing-unboxing when multiple
> snippets are used, so most of the allocations are actually eliminated and
> values are passed in vector registers:
>
>   Long4 testVAdd(Long4 v1, Long4 v2, Long4 v3) {
>       return vadd(vadd(v1, v2), v3);
>   }
>
>       # {method} {0x11532ca28} 'testVAdd'
> '(Ljava/lang/Long4;Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;' in
> 'Main'
>       # parm0:    rsi:rsi   = 'java/lang/Long4'
>       # parm1:    rdx:rdx   = 'java/lang/Long4'
>       # parm2:    rcx:rcx   = 'java/lang/Long4'
>       #           [sp+0x40]  (sp of caller)
>       [...]
>       0x1049d28ec: mov    %rcx,%rbp
>       0x1049d28ef: vmovdqu 0x10(%rdx),%ymm1
>       0x1049d28f4: vmovdqu 0x10(%rsi),%ymm0
>       0x1049d28f9: vpaddd %ymm0,%ymm1,%ymm0  ; snippet #1
>       0x1049d28fd: vmovdqu %ymm0,(%rsp)
>
>       0x1049d2902: movabs $0x7c0011dc8,%rsi  ;
> {metadata('java/lang/Long4')}
>       0x1049d290c: vzeroupper
>       0x1049d290f: callq  0x10496fae0        ;   {runtime_call
> _new_instance_Java}
>
>       0x1049d2914: mov    %rax,%rbx
>       0x1049d2917: vmovdqu 0x10(%rbp),%ymm1
>       0x1049d291c: vmovdqu (%rsp),%ymm0
>
>       0x1049d2921: vpaddd %ymm0,%ymm1,%ymm0  ; snippet #2
>
>       0x1049d2925: vmovdqu %ymm0,0x10(%rbx)
>
>       0x1049d292a: mov    %rbx,%rax
>
>       [...]
>
>       0x1049d293b: retq
>
> For a more complex example see vectorized hash implementation
> (VectorizedHashCode [4]).
>
>
> IDEAS FOR FURTHER IMPROVEMENTS
>
> Keep in mind that the prototype is raw and, though it passes JPRT, there
> were no extensive testing done.
>
> There are some inefficiencies in generated code:
>
>   (1) too many spills of vector values;
> It is a direct consequence of current code snippet representation as a
> native call on IR level. It has unfortunate effect on XMM registers which
> aren't preserved across function calls. Of course, CallSnippet node
> conventions can be adjusted and some vector registers are made callee-save.
>
>   (2) C2 doesn't constant fold loads from value boxes;
> Constant vector boxes aren't constant folded and stay as embedded oops +
> vector loads on every usage. It would be nice to avoid loading values from
> memory and inject them directly into generated code.
>
> One of the root causes of (1) is that the input registers are fixed, so
> there's contention on them when multiple snippets are in play. Register
> renaming in the snippets would alleviate the problem, but requires JVM to
> be able to parse and modify machine code of the snippet when it is inlined.
>
> Also, ability to describe machine code effects (e.g. memory state change)
> would allow more efficient code generated as well.
>
> Stay tuned! Thanks!
>
> Best regards,
> Vladimir Ivanov
>
>