implementation of Vector API and associated lambdas
Graves, Ian L
ian.l.graves at intel.com
Fri Feb 5 01:06:07 UTC 2016
Minor follow on: Is the Patchable Code Snippet code anywhere we can get to it yet?
Thanks!
Ian
-----Original Message-----
From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
Sent: Wednesday, February 03, 2016 6:24 AM
To: Graves, Ian L <ian.l.graves at intel.com>; panama-dev at openjdk.java.net
Subject: Re: implementation of Vector API and associated lambdas
Ian,
>> So, the question is how to map operations on elements to platform-specific instructions. VectorOp seems a viable solution to reliably detect the intended behavior and use appropriate instruction from the available repertoire.
>
> Just so we're clear, you're talking about about VectorOp as it appears here[1]? John mentioned strategies that included what's in VectorOp at the face to face a couple weeks ago. I think a symbolic representation could get us where we want to be for now.
Yes, we are talking about the same VectorOp class.
>> But I haven't thought much about how to implement it. Either introduce special API entry points which consume VectorOp.Op (simplest?) or produce specialized lambdas for VectorOp.Op on per-vector species basis (more optimizable?).
>
> Having an interpreter over VectorOp that can produce specialized lambdas would be really cool. We could potentially suss out some graceful degradation in cases where certain vector widths are not supported in this phase (or fail back to pure Java where we need to). Just a thought.
It looks promising, but it would be great to experiment with that. So far, I don't see any problems, but we should ensure it plays well with inlining.
>> Unfortunately, there's no better way to represent vectors on bytecode level than Long2/4/8 value-based wrappers. JVM simply doesn't have int128_t/int256_t/int512_t-equivalent primitives support. So, boxing (in some quantity) is inevitable for now. But, C2 support for redundant boxing/unboxing should alleviate most of the performance problems. And the larger Vector API pipelines are, the less boxing/unboxing should happen in the optimized code.
>
> This is true. Boxing overhead is going to be present and that's not an issue per se, my question follows from more of a pathological case with an example. Say you have the code snippet (from VectorUtils):
>
> static MethodHandle make_m128_pxor() {
> // MT (Object,L2,L2)L2
> return CodeSnippet.make("mm128_pxor",
> MT_L2_BINARY, requires(SSE2),
> 0x66, 0x0F, 0xEF, 0xC1); // pxor xmm0,xmm1
> }
>
> static final MethodHandle MHm128_pxor = make_m128_pxor();
>
> public static Long2 xor(Long2 v1, Long2 v2) {
> .....
> return (Long2)MHm128_pxor.invokeExact(v1, v2);
> .....
> }
>
> Generates a snippet like so:
>
> Decoding code snippet "mm128_pxor" @ 0x00007f34a11148e0
> 0x00007f34a11148e0: push %rbp
> 0x00007f34a11148e1: mov %rsp,%rbp
> 0x00007f34a11148e4: vmovdqu 0x10(%rsi),%xmm0
> 0x00007f34a11148e9: vmovdqu 0x10(%rdx),%xmm1
> 0x00007f34a11148ee: pxor %xmm1,%xmm0
> 0x00007f34a11148f2: mov %rdi,%rax
> 0x00007f34a11148f5: vmovdqu %xmm0,0x10(%rax)
> 0x00007f34a11148fa: leaveq
> 0x00007f34a11148fb: retq
> 0x00007f34a11148fc: hlt
> 0x00007f34a11148fd: hlt
> 0x00007f34a11148fe: hlt
> 0x00007f34a11148ff: hlt
To be clear, it is a stand-alone version which is used from interpreter and non-optimized generated code. The snippet is the "pxor %xmm1,%xmm0" instruction and all the rest is the code to comply with calling conventions. It is called through MH.linkToNative and wrapped into a NativeMethodHandle.
> Say the programmer decides to compose calls to the handle:
>
> xor(xor(xor(xor(xor(some_long4)))));
>
> Ideal inlining would only un/box once on the boundaries of the snippet (barring spillage). The most non-ideal case is when all of the boxing lives on through the inlining. My question is how much fixup/boxing removal does C2 already do and how much does need to do to approach the ideal?
C2 already produces decent code for such code shape [1], given xor calls are inlined. I did some work to extend box/unbox elimination for value wrappers as part of original code snippet prototype [3].
[1]
http://cr.openjdk.java.net/~vlivanov/panama/code_snippets/vpxor/testNestedXor.txt
000 # stack bang (688 bytes)
pushq rbp # Save rbp
subq rsp, #48 # Create frame
00c movq RBX, RDX # spill
00f movdqu XMM1,[RDX + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
014 movdqu XMM0,[RSI + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
019 code_snippet snippet 4
01d movdqu [rsp + 0],XMM0 # spill
022 movdqu XMM0,[RBX + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
027 movdqu XMM1,[rsp + 0] # spill
02c code_snippet snippet 4
030 movdqu [rsp + 16],XMM0 # spill
036 movdqu XMM0,[rsp + 0] # spill
03b movdqu XMM1,[rsp + 16] # spill
041 code_snippet snippet 4
045 movdqu [rsp + 0],XMM0 # spill
04a movdqu XMM0,[rsp + 16] # spill
050 movdqu XMM1,[rsp + 0] # spill
055 code_snippet snippet 4
059 movdqu [rsp + 16],XMM0 # spill
05f movq RSI, precise klass java/lang/Long2:
0x00007fe194916f50:Constant:exact * # ptr
nop # 2 bytes pad for loops and calls
06b call,static wrapper for: _new_instance_Java
070 movq RBX, RAX # spill
073 # checkcastPP of RBX
073 movdqu XMM0,[rsp + 0] # spill
078 movdqu XMM1,[rsp + 16] # spill
07e code_snippet snippet 4
082 movdqu [RBX + #16 (8-bit)],XMM0 ! store vector (16 bytes) !
Field: java/lang/Long2.l1
087 movq RAX, RBX # spill
08a addq rsp, 48 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
095 ret
As you can see, only 1 box is allocated for the final result. But there are many spills, which are caused fixed calling conventions for the snippet.
If register allocator-friendly snippet is used [2], most of the spills are gone.
[2]
http://cr.openjdk.java.net/~vlivanov/panama/code_snippets/vpxor/testNestedXorRA.txt
000 # stack bang (688 bytes)
pushq rbp # Save rbp
subq rsp, #48 # Create frame
00c movq RBX, RDX # spill
00f movdqu XMM0,[RDX + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
014 movdqu XMM1,[RSI + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
019 code_snippet snippet 4
01d movdqu XMM1,[RBX + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
022 code_snippet snippet 4
026 code_snippet snippet 4
02a code_snippet snippet 4
02e movdqu [rsp + 16],XMM1 # spill
034 movdqu [rsp + 0],XMM0 # spill
039 movq RSI, precise klass java/lang/Long2:
0x00007fdb2a8aa750:Constant:exact * # ptr
043 call,static wrapper for: _new_instance_Java
048 movq RBX, RAX # spill
04b # checkcastPP of RBX
04b movdqu XMM0,[rsp + 0] # spill
050 movdqu XMM1,[rsp + 16] # spill
056 code_snippet snippet 4
05a movdqu [RBX + #16 (8-bit)],XMM0 ! store vector (16 bytes) !
Field: java/lang/Long2.l1
05f movq RAX, RBX # spill
062 addq rsp, 48 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
06d ret
There are still some inefficiencies left.
For example, spills around runtime call for new instance allocation. It would be better to move the allocation either up to the very beginning or down after the last snippet where all the intermediate values are already dead. In that respect, any call is problematic. Safepoints aren't that bad (for non-escaping values EA is able to eliminate the allocation and the box can be allocated during deoptimization), but still there are cases when safepoints hinder box elimination.
Also, vector box elimination doesn't play well with loops. For example, VectorizedHashCode::vectorized_hash [4] has a loop with vector accumulator (Long4 acc). C2 eliminates all boxing inside VHC::vhash8 and you could expect that the accumulator is not boxed as well, but it is and it happens on every iteration.
>> Regarding Stream API, it doesn't look too attractive except for interoperability. Stream fusion (perform the pipeline in a single pass) doesn't help vectors: it is better to apply the pipeline step-by-step on the whole vector than splitting the vector into pieces.
>> What does look viable is to add Vector API-directed optimizations to Stream API implementation - take {Int,Long,Double}Stream + VectorOp-based operations and using vector-aware spliterator translate a pipeline on primitives into a pipeline on vectors.
>
> I Agree. I brought up the Stream API because I think that we could consider the composition of Vector operations similar to something like Stream fusion. When composing operations together, we're doing a kind of fusion (combining snippets together inside of an un/boxing), but we have to give attention to register allocation, or at least some bookkeeping for our data flow. It's an approach that could be an alternative to, or in support of a beefed up fixup pass in C2. I'm not sure if one way is more or less desirable.
As you can see from the examples, C2 is already powerful enough to produce pretty dense code for nested vector operations. I don't see any need in additional work to fuse the operations on library level. Except inlining, of course.
Best regards,
Vladimir Ivanov
[3]
http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/000225.html
[4]
http://hg.openjdk.java.net/panama/panama/jdk/file/79d0c936193b/test/panama/snippets/VectorizedHashCode.java#l59
static int vectorized_hash(byte[] buf, int off, int len) {
Long4 acc = Long4.ZERO;
for (; len >= 8; off += 8, len -= 8) {
acc = vhash8(acc, U.getLong(buf, Unsafe.ARRAY_BYTE_BASE_OFFSET + off));
}
...
070 B4: # B5 <- B6 top-of-loop Freq: 3.19755
070 movl R11, RBP # spill
070
073 B5: # B32 B6 <- B3 B4 Loop: B5-B4 inner Freq: 3.99749
073 movl RBP, R11 # spill
076 subl R11, [RSP + #12 (32-bit)] # int
07b movl [rsp + #8], R11 # spill
080 movq R10, java/lang/Long4:exact * # ptr
08a vmovdqu XMM1,[R10 + #16 (8-bit)] ! load vector (32 bytes) !
Field: java/lang/Long4.l1
090 vmovdqu XMM0,[RBX + #16 (8-bit)] ! load vector (32 bytes) !
Field: java/lang/Long4.l1
095 code_snippet snippet 5
09a vmovdqu [rsp + 32],XMM0 # spill
0a0 movq R10, java/lang/Long2:exact * # ptr
0aa movdqu XMM0,[R10 + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
0b0 movl R11, [rsp + #8] # spill
0b5 addl R11, #16 # int
0b9 movslq R10, R11 # i2l
0bc movq R11, [rsp + #0] # spill
0c0 movq RDX, [R11 + R10] # long
0c4 code_snippet snippet 7
0cb movdqu [rsp + 80],XMM0 # spill
0d1 movq R10, java/lang/Long2:exact * # ptr
0db movdqu XMM1,[R10 + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
0e1 movdqu XMM0,[rsp + 80] # spill
0e7 code_snippet snippet 5
0ec movdqu [rsp + 64],XMM0 # spill
0f2 movq R10, java/lang/Long2:exact * # ptr
0fc movdqu XMM1,[R10 + #16 (8-bit)] ! load vector (16 bytes) !
Field: java/lang/Long2.l1
102 movdqu XMM0,[rsp + 80] # spill
108 code_snippet snippet 5
10d movdqu [rsp + 80],XMM0 # spill
113 movq R10, java/lang/Long4:exact * # ptr
11d vmovdqu XMM0,[R10 + #16 (8-bit)] ! load vector (32 bytes) !
Field: java/lang/Long4.l1
123 movdqu XMM1,[rsp + 64] # spill
129 code_snippet snippet 6
12f movdqu XMM1,[rsp + 80] # spill
135 code_snippet snippet 6
13b movq R10, java/lang/Long4:exact * # ptr
145 vmovdqu XMM1,[R10 + #16 (8-bit)] ! load vector (32 bytes) !
Field: java/lang/Long4.l1
14b code_snippet snippet 5
150 vmovdqu [rsp + 64],XMM0 # spill
156 movq RSI, precise klass java/lang/Long4:
0x00007ffb228919f0:Constant:exact * # ptr
160 call,static wrapper for: _new_instance_Java
168 B6: # B4 B7 <- B5 Freq: 3.99741
168 # checkcastPP of RAX
168 movq RBX, RAX # spill
16b vmovdqu XMM0,[rsp + 32] # spill
171 vmovdqu XMM1,[rsp + 64] # spill
177 code_snippet snippet 4
17b vmovdqu [RBX + #16 (8-bit)],XMM0 ! store vector (32 bytes) !
Field: java/lang/Long4.l1
180 movl R10, [rsp + #12] # spill
185 addl R10, #-8 # int
189 movl [rsp + #12], R10 # spill
18e cmpl R10, #7
192 jg B4 # loop end P=0.799904 C=3350.000000
...
More information about the panama-dev
mailing list