JVM support for arbitrary machine code loading and execution
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Fri Dec 4 17:04:22 UTC 2015
Hi,
FYI I've been exploring machine code snippets recently:
hs: http://hg.openjdk.java.net/panama/panama/hotspot/rev/cd901fd90596
jdk: http://hg.openjdk.java.net/panama/panama/jdk/rev/091cdacf28e5
Samples:
[1] MachineCodeSnippetSamples
http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/MachineCodeSnippetSamples.java
[2] CPUID
http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/CPUID.java
[3] VectorUtils
http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorUtils.java
[4] VectorizedHashCode
http://hg.openjdk.java.net/panama/panama/jdk/file/091cdacf28e5/test/panama/snippets/VectorizedHashCode.java
The main assumption is that JVM doesn't know anything about the machine
code being loaded, but use it "as is".
It provides an easy route to experiment with hardware-specific
instruction sequences without any JVM modifications. As an alternative,
it is possible to write desired code shape in native code and call it
using JNI (or native method handles in Panama). But, unfortunately, it
doesn't work well for small sequences (think about up to a dozen
instructions long) where invocation and argument marshaling overhead
dominates the actual work. So far, the only way was to teach JVM
JIT-compilers new instructions and introduce JVM intrinsics, but it is
quite laborious task.
In order to wire a machine code snippet to Java code, some implicit
conventions about arguments and return value mappings are needed
(calling convention). Method type of the corresponding method handle
plays central role here and provides complete description of the wiring.
Current prototype focuses on x64. The conventions are based on System V
AMD64 ABI (6 regs for int/pointer arguments + 8 regs for floating-point
arguments), but with some adjustments (more on it later).
The implementation heavily relies on native method handles, but also
introduces specific optimizations in C2 (e.g. snippet inlining).
HOW TO USE
A user has to provide (1) machine code snippet; and (2) method type:
MethodHandle jdk.internal.panama.CodeSnippet.make(
String name,
MethodType type,
boolean isSupported,
int... code)
The framework wraps the machine code and returns a method handle of
requested type. The type is used to map Java arguments to registers.
Constructed method handle is fully functional and can be bound to
invokedynamic instruction or invoked using MH.invokeExact(),
MH.invoke(), or MH.invokeWithArguments() methods.
For convenience purposes, when isSupported=false is passed, the
framework constructs a method handle which always throws an exception.
Also, for diagnostic purposes, it's possible to assign a name to the
snippet.
EXAMPLE
256-bit memory-to-memory move with double-register addressing mode:
MethodHandle mov256MH = CodeSnippet.make("move256",
MethodType.methodType(void.class, // return type
Object.class /*rdi*/, // src
long.class /*rsi*/, // offset
Object.class /*rdx*/, // dst
long.class /*rcx*/), // offset
CPUID.has(AVX),
0xC4, 0xE1, 0x7E, 0x6F, 0x04, 0x37, // vmovdqu ymm0,[rsi+rdi]
0xC4, 0xE1, 0x7E, 0x7F, 0x04, 0x0A); // vmovdqu [rdx+rcx],ymm0
static void move256(Object src, long off1, Object dst, long off2) {
try {
mov256MH.invokeExact(src, off1, dst, off2);
} catch (Throwable e) {
throw new Error(e);
}
}
byte[] src = ...; long off1 = ...;
byte[] dst = ...; long off2 = ...;
move256(src, off1, dst, off2);
Other examples: MachineCodeSnippetSamples [1], CPUID [2].
IMPLEMENTATION DETAILS
There are 2 execution modes:
(1) non-optimized (called from interpreter/C1/non-constant MH in C2);
(2) optimized (a snippet is inlined by C2).
Non-optimized case is supported by a stand-alone version generated
on-the-fly as a native function.
C2 has a choice either to issue a direct native call or try to inline
the snippet.
Stand-alone snippet version for move256 (-XX:+PrintCodeSnippets):
Decoding code snippet "mov256" @ 0x1049bc460
0x1049bc460: push %rbp
0x1049bc461: mov %rsp,%rbp
0x1049bc464: vmovdqu (%rdi,%rsi,1),%ymm0
0x1049bc46a: vmovdqu %ymm0,(%rdx,%rcx,1)
0x1049bc470: leaveq
0x1049bc471: retq
C2 compiled version (w/ snippet inline):
# {method} {0x115c2f880} 'mov256'
'(Ljava/lang/Object;JLjava/lang/Object;J)V'
# parm0: rsi:rsi = 'java/lang/Object'
# parm1: rdx:rdx = long
# parm2: rcx:rcx = 'java/lang/Object'
# parm3: r8:r8 = long
# [sp+0x20] (sp of caller)
0x1051bd560: mov %eax,-0x16000(%rsp)
0x1051bd567: push %rbp
0x1051bd568: sub $0x10,%rsp
0x1051bd56c: mov %rsi,%rdi
0x1051bd56f: mov %rdx,%rsi
0x1051bd572: mov %rcx,%rdx
0x1051bd575: mov %r8,%rcx
0x1051bd578: vmovdqu (%rdi,%rsi,1),%ymm0
0x1051bd57e: vmovdqu %ymm0,(%rdx,%rcx,1)
0x1051bd584: add $0x10,%rsp
0x1051bd588: pop %rbp
0x1051bd589: test %eax,-0x4d3d58f(%rip)
0x1051bd58f: retq
VECTOR SUPPORT
Initially, my main motivation was to play with vector instructions and
experiment how to efficiently pass vectors values around in Java &
generated code.
There are 3 new wrappers: java.lang.Long2, java.long.Long4, and
java.long.Long8, which represent 128-, 256-, and 512-bit vector values
respectively.
When j.l.Long2/4/8 argument is passed into the snippet, it is unboxed
into a vector register (xmm*/ymm*/zmm*). (It means they clash with
floating-point arguments, so they are considered as floating-point
arguments by register allocator). If vector value is returned, it is
expected in xmm0/ymm0/zmm0 (depending on its size). It is automatically
copied into a preallocated box which is passed as a first Object
argument into the snippet. The box for return value is allocated
implicitly on every invocation.
Example: vpadd instruction (MachineCodeSnippetSamples [1]):
MethodHandle vpaddMH = CodeSnippet.make("vpaddd",
MethodType.methodType(Long4.class, Long4.class, Long4.class),
requires(AVX),
0xC5, 0xF5, 0xFE, 0xC0); // vpaddd %ymm0, %ymm1, %ymm0
public static Long4 vadd(Long4 v1, Long4 v2) {
try {
return (Long4)MHm256_vadd.invokeExact(v1, v2);
} catch (Throwable e) {
throw new Error(e);
}
}
Stand-alone snippet version:
0x1049bc4e0: push %rbp
0x1049bc4e1: mov %rsp,%rbp
0x1049bc4e4: vmovdqu 0x10(%rsi),%ymm0 ; unbox arguments
0x1049bc4e9: vmovdqu 0x10(%rdx),%ymm1
0x1049bc4ee: vpaddd %ymm0,%ymm1,%ymm0 ; snippet
0x1049bc4f2: mov %rdi,%rax
0x1049bc4f5: vmovdqu %ymm0,0x10(%rax) ; box return value
0x1049bc4fa: leaveq
0x1049bc4fb: retq
C2 compiled version (w/ snippet inline):
# {method} {0x115b2c0e8} 'vadd'
'(Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;'
# parm0: rsi:rsi = 'java/lang/Long4'
# parm1: rdx:rdx = 'java/lang/Long4'
# [sp+0x30] (sp of caller)
[...]
0x1051c9a73: movabs $0x7c0011dc8,%rsi ;
{metadata('java/lang/Long4')}
0x1051c9a7d: nop
0x1051c9a7e: nop
0x1051c9a7f: nop
0x1051c9a80: vzeroupper
0x1051c9a83: callq 0x10516f260 ; {runtime_call
_new_instance_Java}
0x1051c9a88: mov %rax,%rbx ; unbox arguments
0x1051c9a8b: vmovdqu 0x10(%rbp),%ymm0 ;
0x1051c9a90: mov (%rsp),%r10 ;
0x1051c9a94: vmovdqu 0x10(%r10),%ymm1 ;
0x1051c9a9a: vpaddd %ymm0,%ymm1,%ymm0 ; snippet
0x1051c9a9e: vmovdqu %ymm0,0x10(%rbx) ; box return value
0x1051c9aa3: mov %rbx,%rax
[...]
0x1051c9ab4: retq
C2 aggressively tries to eliminate excessive boxing-unboxing when
multiple snippets are used, so most of the allocations are actually
eliminated and values are passed in vector registers:
Long4 testVAdd(Long4 v1, Long4 v2, Long4 v3) {
return vadd(vadd(v1, v2), v3);
}
# {method} {0x11532ca28} 'testVAdd'
'(Ljava/lang/Long4;Ljava/lang/Long4;Ljava/lang/Long4;)Ljava/lang/Long4;'
in 'Main'
# parm0: rsi:rsi = 'java/lang/Long4'
# parm1: rdx:rdx = 'java/lang/Long4'
# parm2: rcx:rcx = 'java/lang/Long4'
# [sp+0x40] (sp of caller)
[...]
0x1049d28ec: mov %rcx,%rbp
0x1049d28ef: vmovdqu 0x10(%rdx),%ymm1
0x1049d28f4: vmovdqu 0x10(%rsi),%ymm0
0x1049d28f9: vpaddd %ymm0,%ymm1,%ymm0 ; snippet #1
0x1049d28fd: vmovdqu %ymm0,(%rsp)
0x1049d2902: movabs $0x7c0011dc8,%rsi ;
{metadata('java/lang/Long4')}
0x1049d290c: vzeroupper
0x1049d290f: callq 0x10496fae0 ; {runtime_call
_new_instance_Java}
0x1049d2914: mov %rax,%rbx
0x1049d2917: vmovdqu 0x10(%rbp),%ymm1
0x1049d291c: vmovdqu (%rsp),%ymm0
0x1049d2921: vpaddd %ymm0,%ymm1,%ymm0 ; snippet #2
0x1049d2925: vmovdqu %ymm0,0x10(%rbx)
0x1049d292a: mov %rbx,%rax
[...]
0x1049d293b: retq
For a more complex example see vectorized hash implementation
(VectorizedHashCode [4]).
IDEAS FOR FURTHER IMPROVEMENTS
Keep in mind that the prototype is raw and, though it passes JPRT, there
were no extensive testing done.
There are some inefficiencies in generated code:
(1) too many spills of vector values;
It is a direct consequence of current code snippet representation as a
native call on IR level. It has unfortunate effect on XMM registers
which aren't preserved across function calls. Of course, CallSnippet
node conventions can be adjusted and some vector registers are made
callee-save.
(2) C2 doesn't constant fold loads from value boxes;
Constant vector boxes aren't constant folded and stay as embedded oops +
vector loads on every usage. It would be nice to avoid loading values
from memory and inject them directly into generated code.
One of the root causes of (1) is that the input registers are fixed, so
there's contention on them when multiple snippets are in play. Register
renaming in the snippets would alleviate the problem, but requires JVM
to be able to parse and modify machine code of the snippet when it is
inlined.
Also, ability to describe machine code effects (e.g. memory state
change) would allow more efficient code generated as well.
Stay tuned! Thanks!
Best regards,
Vladimir Ivanov
More information about the panama-dev
mailing list