JVM support for arbitrary machine code loading and execution
Vitaly Davidovich
vitalyd at gmail.com
Fri Dec 4 17:11:05 UTC 2015
Wow, that's awesome. Could something like this ever be exposed to user
code (java's version of inline asm)? (ducks under the desk :))
On Fri, Dec 4, 2015 at 12:04 PM, Vladimir Ivanov <
vladimir.x.ivanov at oracle.com> wrote:
> Hi,
>
> FYI I've been exploring machine code snippets recently:
>
> hs: http://hg.openjdk.java.net/panama/panama/hotspot/rev/cd901fd90596
> jdk: http://hg.openjdk.java.net/panama/panama/jdk/rev/091cdacf28e5
>
> Samples:
> [1] MachineCodeSnippetSamples
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/MachineCodeSnippetSamples.java
>
> [2] CPUID
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/CPUID.java
>
> [3] VectorUtils
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorUtils.java
>
> [4] VectorizedHashCode
>
>
> http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorizedHashCode.java
>
> The main assumption is that JVM doesn't know anything about the machine
> code being loaded, but use it "as is".
>
> It provides an easy route to experiment with hardware-specific instruction
> sequences without any JVM modifications. As an alternative, it is possible
> to write desired code shape in native code and call it using JNI (or native
> method handles in Panama). But, unfortunately, it doesn't work well for
> small sequences (think about up to a dozen instructions long) where
> invocation and argument marshaling overhead dominates the actual work. So
> far, the only way was to teach JVM JIT-compilers new instructions and
> introduce JVM intrinsics, but it is quite laborious task.
>
> In order to wire a machine code snippet to Java code, some implicit
> conventions about arguments and return value mappings are needed (calling
> convention). Method type of the corresponding method handle plays central
> role here and provides complete description of the wiring.
>
> Current prototype focuses on x64. The conventions are based on System V
> AMD64 ABI (6 regs for int/pointer arguments + 8 regs for floating-point
> arguments), but with some adjustments (more on it later).
>
> The implementation heavily relies on native method handles, but also
> introduces specific optimizations in C2 (e.g. snippet inlining).
>
>
> HOW TO USE
>
> A user has to provide (1) machine code snippet; and (2) method type:
>
> MethodHandle jdk.internal.panama.CodeSnippet.make(
> String name,
> MethodType type,
> boolean isSupported,
> int... code)
>
> The framework wraps the machine code and returns a method handle of
> requested type. The type is used to map Java arguments to registers.
>
> Constructed method handle is fully functional and can be bound to
> invokedynamic instruction or invoked using MH.invokeExact(), MH.invoke(),
> or MH.invokeWithArguments() methods.
>
> For convenience purposes, when isSupported=false is passed, the framework
> constructs a method handle which always throws an exception.
>
> Also, for diagnostic purposes, it's possible to assign a name to the
> snippet.
>
> EXAMPLE
>
> 256-bit memory-to-memory move with double-register addressing mode:
>
> MethodHandle mov256MH = CodeSnippet.make("move256",
> MethodType.methodType(void.class, // return type
> Object.class /*rdi*/, // src
> long.class /*rsi*/, // offset
> Object.class /*rdx*/, // dst
> long.class /*rcx*/), // offset
> CPUID.has(AVX),
> 0xC4, 0xE1, 0x7E, 0x6F, 0x04, 0x37, // vmovdqu ymm0,[rsi+rdi]
> 0xC4, 0xE1, 0x7E, 0x7F, 0x04, 0x0A); // vmovdqu [rdx+rcx],ymm0
>
>
> static void move256(Object src, long off1, Object dst, long off2) {
> try {
> mov256MH.invokeExact(src, off1, dst, off2);
> } catch (Throwable e) {
> throw new Error(e);
> }
> }
>
> byte[] src = ...; long off1 = ...;
> byte[] dst = ...; long off2 = ...;
> move256(src, off1, dst, off2);
>
> Other examples: MachineCodeSnippetSamples [1], CPUID [2].
>
>
> IMPLEMENTATION DETAILS
>
> There are 2 execution modes:
> (1) non-optimized (called from interpreter/C1/non-constant MH in C2);
>
> (2) optimized (a snippet is inlined by C2).
>
> Non-optimized case is supported by a stand-alone version generated
> on-the-fly as a native function.
>
> C2 has a choice either to issue a direct native call or try to inline the
> snippet.
>
> Stand-alone snippet version for move256 (-XX:+PrintCodeSnippets):
> Decoding code snippet "mov256" @ 0x1049bc460
> 0x1049bc460: push %rbp
> 0x1049bc461: mov %rsp,%rbp
> 0x1049bc464: vmovdqu (%rdi,%rsi,1),%ymm0
> 0x1049bc46a: vmovdqu %ymm0,(%rdx,%rcx,1)
> 0x1049bc470: leaveq
> 0x1049bc471: retq
>
> C2 compiled version (w/ snippet inline):
> # {method} {0x115c2f880} 'mov256'
> '(Ljava/lang/Object;JLjava/lang/Object;J)V'
> # parm0: rsi:rsi = 'java/lang/Object'
> # parm1: rdx:rdx = long
> # parm2: rcx:rcx = 'java/lang/Object'
> # parm3: r8:r8 = long
> # [sp+0x20] (sp of caller)
>
> 0x1051bd560: mov %eax,-0x16000(%rsp)
> 0x1051bd567: push %rbp
> 0x1051bd568: sub $0x10,%rsp
>
> 0x1051bd56c: mov %rsi,%rdi
> 0x1051bd56f: mov %rdx,%rsi
> 0x1051bd572: mov %rcx,%rdx
> 0x1051bd575: mov %r8,%rcx
>
> 0x1051bd578: vmovdqu (%rdi,%rsi,1),%ymm0
> 0x1051bd57e: vmovdqu %ymm0,(%rdx,%rcx,1)
>
> 0x1051bd584: add $0x10,%rsp
> 0x1051bd588: pop %rbp
> 0x1051bd589: test %eax,-0x4d3d58f(%rip)
>
> 0x1051bd58f: retq
>
>
> VECTOR SUPPORT
>
> Initially, my main motivation was to play with vector instructions and
> experiment how to efficiently pass vectors values around in Java &
> generated code.
>
> There are 3 new wrappers: java.lang.Long2, java.long.Long4, and
> java.long.Long8, which represent 128-, 256-, and 512-bit vector values
> respectively.
>
> When j.l.Long2/4/8 argument is passed into the snippet, it is unboxed into
> a vector register (xmm*/ymm*/zmm*). (It means they clash with
> floating-point arguments, so they are considered as floating-point
> arguments by register allocator). If vector value is returned, it is
> expected in xmm0/ymm0/zmm0 (depending on its size). It is automatically
> copied into a preallocated box which is passed as a first Object argument
> into the snippet. The box for return value is allocated implicitly on every
> invocation.
>
> Example: vpadd instruction (MachineCodeSnippetSamples [1]):
>
> MethodHandle vpaddMH = CodeSnippet.make("vpaddd",
> MethodType.methodType(Long4.class, Long4.class, Long4.class),
> requires(AVX),
> 0xC5, 0xF5, 0xFE, 0xC0); // vpaddd %ymm0, %ymm1, %ymm0
>
> public static Long4 vadd(Long4 v1, Long4 v2) {
> try {
> return (Long4)MHm256_vadd.invokeExact(v1, v2);
> } catch (Throwable e) {
> throw new Error(e);
> }
> }
>
> Stand-alone snippet version:
>
> 0x1049bc4e0: push %rbp
> 0x1049bc4e1: mov %rsp,%rbp
> 0x1049bc4e4: vmovdqu 0x10(%rsi),%ymm0 ; unbox arguments
> 0x1049bc4e9: vmovdqu 0x10(%rdx),%ymm1
>
> 0x1049bc4ee: vpaddd %ymm0,%ymm1,%ymm0 ; snippet
>
> 0x1049bc4f2: mov %rdi,%rax
> 0x1049bc4f5: vmovdqu %ymm0,0x10(%rax) ; box return value
> 0x1049bc4fa: leaveq
> 0x1049bc4fb: retq
>
> C2 compiled version (w/ snippet inline):
>
> # {method} {0x115b2c0e8} 'vadd'
> '(Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;'
> # parm0: rsi:rsi = 'java/lang/Long4'
> # parm1: rdx:rdx = 'java/lang/Long4'
> # [sp+0x30] (sp of caller)
> [...]
> 0x1051c9a73: movabs $0x7c0011dc8,%rsi ;
> {metadata('java/lang/Long4')}
> 0x1051c9a7d: nop
> 0x1051c9a7e: nop
> 0x1051c9a7f: nop
> 0x1051c9a80: vzeroupper
> 0x1051c9a83: callq 0x10516f260 ; {runtime_call
> _new_instance_Java}
>
> 0x1051c9a88: mov %rax,%rbx ; unbox arguments
> 0x1051c9a8b: vmovdqu 0x10(%rbp),%ymm0 ;
> 0x1051c9a90: mov (%rsp),%r10 ;
> 0x1051c9a94: vmovdqu 0x10(%r10),%ymm1 ;
>
> 0x1051c9a9a: vpaddd %ymm0,%ymm1,%ymm0 ; snippet
>
> 0x1051c9a9e: vmovdqu %ymm0,0x10(%rbx) ; box return value
> 0x1051c9aa3: mov %rbx,%rax
>
> [...]
>
> 0x1051c9ab4: retq
>
> C2 aggressively tries to eliminate excessive boxing-unboxing when multiple
> snippets are used, so most of the allocations are actually eliminated and
> values are passed in vector registers:
>
> Long4 testVAdd(Long4 v1, Long4 v2, Long4 v3) {
> return vadd(vadd(v1, v2), v3);
> }
>
> # {method} {0x11532ca28} 'testVAdd'
> '(Ljava/lang/Long4;Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;' in
> 'Main'
> # parm0: rsi:rsi = 'java/lang/Long4'
> # parm1: rdx:rdx = 'java/lang/Long4'
> # parm2: rcx:rcx = 'java/lang/Long4'
> # [sp+0x40] (sp of caller)
> [...]
> 0x1049d28ec: mov %rcx,%rbp
> 0x1049d28ef: vmovdqu 0x10(%rdx),%ymm1
> 0x1049d28f4: vmovdqu 0x10(%rsi),%ymm0
> 0x1049d28f9: vpaddd %ymm0,%ymm1,%ymm0 ; snippet #1
> 0x1049d28fd: vmovdqu %ymm0,(%rsp)
>
> 0x1049d2902: movabs $0x7c0011dc8,%rsi ;
> {metadata('java/lang/Long4')}
> 0x1049d290c: vzeroupper
> 0x1049d290f: callq 0x10496fae0 ; {runtime_call
> _new_instance_Java}
>
> 0x1049d2914: mov %rax,%rbx
> 0x1049d2917: vmovdqu 0x10(%rbp),%ymm1
> 0x1049d291c: vmovdqu (%rsp),%ymm0
>
> 0x1049d2921: vpaddd %ymm0,%ymm1,%ymm0 ; snippet #2
>
> 0x1049d2925: vmovdqu %ymm0,0x10(%rbx)
>
> 0x1049d292a: mov %rbx,%rax
>
> [...]
>
> 0x1049d293b: retq
>
> For a more complex example see vectorized hash implementation
> (VectorizedHashCode [4]).
>
>
> IDEAS FOR FURTHER IMPROVEMENTS
>
> Keep in mind that the prototype is raw and, though it passes JPRT, there
> were no extensive testing done.
>
> There are some inefficiencies in generated code:
>
> (1) too many spills of vector values;
> It is a direct consequence of current code snippet representation as a
> native call on IR level. It has unfortunate effect on XMM registers which
> aren't preserved across function calls. Of course, CallSnippet node
> conventions can be adjusted and some vector registers are made callee-save.
>
> (2) C2 doesn't constant fold loads from value boxes;
> Constant vector boxes aren't constant folded and stay as embedded oops +
> vector loads on every usage. It would be nice to avoid loading values from
> memory and inject them directly into generated code.
>
> One of the root causes of (1) is that the input registers are fixed, so
> there's contention on them when multiple snippets are in play. Register
> renaming in the snippets would alleviate the problem, but requires JVM to
> be able to parse and modify machine code of the snippet when it is inlined.
>
> Also, ability to describe machine code effects (e.g. memory state change)
> would allow more efficient code generated as well.
>
> Stay tuned! Thanks!
>
> Best regards,
> Vladimir Ivanov
>
>
More information about the panama-dev
mailing list