implementation of Vector API and associated lambdas

Wed Feb 3 14:24:12 UTC 2016

Ian,

>> So, the question is how to map operations on elements to platform-specific instructions. VectorOp seems a viable solution to reliably detect the intended behavior and use appropriate instruction from the available repertoire.
>
> Just so we're clear, you're talking about about VectorOp as it appears here[1]?  John mentioned strategies that included what's in VectorOp at the face to face a couple weeks ago.  I think a symbolic representation could get us where we want to be for now.
Yes, we are talking about the same VectorOp class.

>> But I haven't thought much about how to implement it. Either introduce special API entry points which consume VectorOp.Op (simplest?) or produce specialized lambdas for VectorOp.Op on per-vector species basis (more optimizable?).
>
> Having an interpreter over VectorOp that can produce specialized lambdas would be really cool.  We could potentially suss out some graceful degradation in cases where certain vector widths are not supported in this phase (or fail back to pure Java where we need to).  Just a thought.
It looks promising, but it would be great to experiment with that. So 
far, I don't see any problems, but we should ensure it plays well with 
inlining.

>> Unfortunately, there's no better way to represent vectors on bytecode level than Long2/4/8 value-based wrappers. JVM simply doesn't have int128_t/int256_t/int512_t-equivalent primitives support. So, boxing (in some quantity) is inevitable for now. But, C2 support for redundant boxing/unboxing should alleviate most of the performance problems. And the larger Vector API pipelines are, the less boxing/unboxing should happen in the optimized code.
>
> This is true.  Boxing overhead is going to be present and that's not an issue per se, my question follows from more of a pathological case with an example.  Say you have the code snippet (from VectorUtils):
>
>      static MethodHandle make_m128_pxor() {
>          // MT (Object,L2,L2)L2
>          return CodeSnippet.make("mm128_pxor",
>                  MT_L2_BINARY, requires(SSE2),
>                  0x66, 0x0F, 0xEF, 0xC1); // pxor xmm0,xmm1
>      }
>
>      static final MethodHandle MHm128_pxor = make_m128_pxor();
>
>      public static Long2 xor(Long2 v1, Long2 v2) {
>              .....
>              return (Long2)MHm128_pxor.invokeExact(v1, v2);
>              .....
>      }
>
> Generates a snippet like so:
>
> Decoding code snippet "mm128_pxor" @ 0x00007f34a11148e0
>    0x00007f34a11148e0: push   %rbp
>    0x00007f34a11148e1: mov    %rsp,%rbp
>    0x00007f34a11148e4: vmovdqu 0x10(%rsi),%xmm0
>    0x00007f34a11148e9: vmovdqu 0x10(%rdx),%xmm1
>    0x00007f34a11148ee: pxor   %xmm1,%xmm0
>    0x00007f34a11148f2: mov    %rdi,%rax
>    0x00007f34a11148f5: vmovdqu %xmm0,0x10(%rax)
>    0x00007f34a11148fa: leaveq
>    0x00007f34a11148fb: retq
>    0x00007f34a11148fc: hlt
>    0x00007f34a11148fd: hlt
>    0x00007f34a11148fe: hlt
>    0x00007f34a11148ff: hlt
To be clear, it is a stand-alone version which is used from interpreter 
and non-optimized generated code. The snippet is the "pxor 
%xmm1,%xmm0" instruction and all the rest is the code to comply with 
calling conventions. It is called through MH.linkToNative and wrapped 
into a NativeMethodHandle.

> Say the programmer decides to compose calls to the handle:
>
> xor(xor(xor(xor(xor(some_long4)))));
>
> Ideal inlining would only un/box once on the boundaries of the snippet (barring spillage).  The most non-ideal case is when all of the boxing lives on through the inlining.  My question is how much fixup/boxing removal does C2 already do and how much does need to do to approach the ideal?
C2 already produces decent code for such code shape [1], given xor calls 
are inlined. I did some work to extend box/unbox elimination for value 
wrappers as part of original code snippet prototype [3].

[1] 
http://cr.openjdk.java.net/~vlivanov/panama/code_snippets/vpxor/testNestedXor.txt

   000     # stack bang (688 bytes)
     pushq   rbp # Save rbp
     subq    rsp, #48  # Create frame

   00c     movq    RBX, RDX  # spill
   00f     movdqu  XMM1,[RDX + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
   014     movdqu  XMM0,[RSI + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
   019     code_snippet snippet 4

   01d     movdqu  [rsp + 0],XMM0  # spill
   022     movdqu  XMM0,[RBX + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
   027     movdqu  XMM1,[rsp + 0]  # spill
   02c     code_snippet snippet 4

   030     movdqu  [rsp + 16],XMM0 # spill
   036     movdqu  XMM0,[rsp + 0]  # spill
   03b     movdqu  XMM1,[rsp + 16] # spill
   041     code_snippet snippet 4

   045     movdqu  [rsp + 0],XMM0  # spill
   04a     movdqu  XMM0,[rsp + 16] # spill
   050     movdqu  XMM1,[rsp + 0]  # spill
   055     code_snippet snippet 4

   059     movdqu  [rsp + 16],XMM0 # spill
   05f     movq    RSI, precise klass java/lang/Long2: 
0x00007fe194916f50:Constant:exact * # ptr
           nop   # 2 bytes pad for loops and calls
   06b     call,static  wrapper for: _new_instance_Java

   070     movq    RBX, RAX  # spill
   073     # checkcastPP of RBX
   073     movdqu  XMM0,[rsp + 0]  # spill
   078     movdqu  XMM1,[rsp + 16] # spill
   07e     code_snippet snippet 4

   082     movdqu  [RBX + #16 (8-bit)],XMM0  ! store vector (16 bytes) ! 
Field: java/lang/Long2.l1
   087     movq    RAX, RBX  # spill
   08a     addq    rsp, 48 # Destroy frame
     popq   rbp
     testl  rax, [rip + #offset_to_poll_page]  # Safepoint: poll for GC

   095     ret

As you can see, only 1 box is allocated for the final result. But there 
are many spills, which are caused fixed calling conventions for the snippet.

If register allocator-friendly snippet is used [2], most of the spills 
are gone.

[2] 
http://cr.openjdk.java.net/~vlivanov/panama/code_snippets/vpxor/testNestedXorRA.txt

   000     # stack bang (688 bytes)
     pushq   rbp # Save rbp
     subq    rsp, #48  # Create frame

   00c     movq    RBX, RDX  # spill
   00f     movdqu  XMM0,[RDX + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
   014     movdqu  XMM1,[RSI + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
   019     code_snippet snippet 4

   01d     movdqu  XMM1,[RBX + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
   022     code_snippet snippet 4

   026     code_snippet snippet 4

   02a     code_snippet snippet 4

   02e     movdqu  [rsp + 16],XMM1 # spill
   034     movdqu  [rsp + 0],XMM0  # spill
   039     movq    RSI, precise klass java/lang/Long2: 
0x00007fdb2a8aa750:Constant:exact * # ptr
   043     call,static  wrapper for: _new_instance_Java

   048     movq    RBX, RAX  # spill
   04b     # checkcastPP of RBX
   04b     movdqu  XMM0,[rsp + 0]  # spill
   050     movdqu  XMM1,[rsp + 16] # spill
   056     code_snippet snippet 4

   05a     movdqu  [RBX + #16 (8-bit)],XMM0  ! store vector (16 bytes) ! 
Field: java/lang/Long2.l1
   05f     movq    RAX, RBX  # spill
   062     addq    rsp, 48 # Destroy frame
     popq   rbp
     testl  rax, [rip + #offset_to_poll_page]  # Safepoint: poll for GC

   06d     ret

There are still some inefficiencies left.

For example, spills around runtime call for new instance allocation. It 
would be better to move the allocation either up to the very beginning 
or down after the last snippet where all the intermediate values are 
already dead. In that respect, any call is problematic. Safepoints 
aren't that bad (for non-escaping values EA is able to eliminate the 
allocation and the box can be allocated during deoptimization), but 
still there are cases when safepoints hinder box elimination.

Also, vector box elimination doesn't play well with loops. For example, 
VectorizedHashCode::vectorized_hash [4] has a loop with vector 
accumulator (Long4 acc). C2 eliminates all boxing inside VHC::vhash8 and 
you could expect that the accumulator is not boxed as well, but it is 
and it happens on every iteration.

>> Regarding Stream API, it doesn't look too attractive except for interoperability. Stream fusion (perform the pipeline in a single pass) doesn't help vectors: it is better to apply the pipeline step-by-step on the whole vector than splitting the vector into pieces.
>> What does look viable is to add Vector API-directed optimizations to Stream API implementation - take {Int,Long,Double}Stream + VectorOp-based operations and using vector-aware spliterator translate a pipeline on primitives into a pipeline on vectors.
>
> I Agree.  I brought up the Stream API because I think that we could consider the composition of Vector operations similar to something like Stream fusion.  When composing operations together, we're doing a kind of fusion (combining snippets together inside of an un/boxing), but we have to give attention to register allocation, or at least some bookkeeping for our data flow.  It's an approach that could be an alternative to, or in support of a beefed up fixup pass in C2.  I'm not sure if one way is more or less desirable.
As you can see from the examples, C2 is already powerful enough to 
produce pretty dense code for nested vector operations. I don't see any 
need in additional work to fuse the operations on library level. Except 
inlining, of course.

Best regards,
Vladimir Ivanov

[3] 
http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/000225.html

[4] 
http://hg.openjdk.java.net/panama/panama/jdk/file/79d0c936193b/test/panama/snippets/VectorizedHashCode.java#l59

   static int vectorized_hash(byte[] buf, int off, int len) {
       Long4 acc = Long4.ZERO;
       for (; len >= 8; off += 8, len -= 8) {
           acc = vhash8(acc, U.getLong(buf, 
Unsafe.ARRAY_BYTE_BASE_OFFSET + off));
       }

...
070   B4: # B5 <- B6  top-of-loop Freq: 3.19755
070     movl    R11, RBP  # spill
070
073   B5: # B32 B6 <- B3 B4   Loop: B5-B4 inner  Freq: 3.99749
073     movl    RBP, R11  # spill
076     subl    R11, [RSP + #12 (32-bit)] # int
07b     movl    [rsp + #8], R11 # spill
080     movq    R10, java/lang/Long4:exact *  # ptr
08a     vmovdqu XMM1,[R10 + #16 (8-bit)]  ! load vector (32 bytes) ! 
Field: java/lang/Long4.l1
090     vmovdqu XMM0,[RBX + #16 (8-bit)]  ! load vector (32 bytes) ! 
Field: java/lang/Long4.l1
095     code_snippet snippet 5

09a     vmovdqu [rsp + 32],XMM0 # spill
0a0     movq    R10, java/lang/Long2:exact *  # ptr
0aa     movdqu  XMM0,[R10 + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
0b0     movl    R11, [rsp + #8] # spill
0b5     addl    R11, #16  # int
0b9     movslq  R10, R11  # i2l
0bc     movq    R11, [rsp + #0] # spill
0c0     movq    RDX, [R11 + R10]  # long
0c4     code_snippet snippet 7

0cb     movdqu  [rsp + 80],XMM0 # spill
0d1     movq    R10, java/lang/Long2:exact *  # ptr
0db     movdqu  XMM1,[R10 + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
0e1     movdqu  XMM0,[rsp + 80] # spill
0e7     code_snippet snippet 5

0ec     movdqu  [rsp + 64],XMM0 # spill
0f2     movq    R10, java/lang/Long2:exact *  # ptr
0fc     movdqu  XMM1,[R10 + #16 (8-bit)]  ! load vector (16 bytes) ! 
Field: java/lang/Long2.l1
102     movdqu  XMM0,[rsp + 80] # spill
108     code_snippet snippet 5

10d     movdqu  [rsp + 80],XMM0 # spill
113     movq    R10, java/lang/Long4:exact *  # ptr
11d     vmovdqu XMM0,[R10 + #16 (8-bit)]  ! load vector (32 bytes) ! 
Field: java/lang/Long4.l1
123     movdqu  XMM1,[rsp + 64] # spill
129     code_snippet snippet 6

12f     movdqu  XMM1,[rsp + 80] # spill
135     code_snippet snippet 6

13b     movq    R10, java/lang/Long4:exact *  # ptr
145     vmovdqu XMM1,[R10 + #16 (8-bit)]  ! load vector (32 bytes) ! 
Field: java/lang/Long4.l1
14b     code_snippet snippet 5

150     vmovdqu [rsp + 64],XMM0 # spill
156     movq    RSI, precise klass java/lang/Long4: 
0x00007ffb228919f0:Constant:exact * # ptr
160     call,static  wrapper for: _new_instance_Java

168   B6: # B4 B7 <- B5  Freq: 3.99741
168     # checkcastPP of RAX
168     movq    RBX, RAX  # spill
16b     vmovdqu XMM0,[rsp + 32] # spill
171     vmovdqu XMM1,[rsp + 64] # spill
177     code_snippet snippet 4

17b     vmovdqu [RBX + #16 (8-bit)],XMM0  ! store vector (32 bytes) ! 
Field: java/lang/Long4.l1
180     movl    R10, [rsp + #12]  # spill
185     addl    R10, #-8  # int
189     movl    [rsp + #12], R10  # spill
18e     cmpl    R10, #7
192     jg     B4 # loop end  P=0.799904 C=3350.000000
...