Some New Vector API Code + Code Snippets

Tue Jun 7 10:25:41 UTC 2016

Ian,

> We've been working more on Vector API concrete classes and structuring these classes to better support specialization (ala Stream) and enable us to loads and stores in arrays of primitive types.  Additionally, we've been working on better structuring our Code Snippets and adding support for modRM and SIB encoding that supports memory accesses.  Vladimir Ivanov has been incredibly helpful getting us to where we are with this now!
>
> Where before we structured our Vector inheritance as Vector > Concrete Instance (Element x Shape/Size), we now go Vector > Elemental Superclass > Sized Concrete Class.  For example, for our 256-bit float implementation, the structure is Vector > FloatVector > Float256Vector.  Methods supporting read/writes into float[] arrays make an appearance in FloatVector as abstract methods, and are implemented fully in Float256Vector.  Right now the elemental classes are pretty slim.  Most of the methods still reside in the Vector class.  In the case of streams, it seems that the superclass is the lightweight one, and the intermediate specialized classes are more heavy weight (BaseStream vs Stream or IntStream etc.).   It might make sense to draw down the Vector superclass and bulk out the specialized classes here so we can elide most or all of our boxed primitives from the design.  I think there is definitely some design and structuring work to be done with respect to structuring these classes.  Would love to get some opinions on that.
>
> I've included some tests that drive the classes as they are in the webrev.  Vladimir suggested that we use the double register approach (Object + long offset) for invoking vmovups/vmovdqu instructions.  The tests work for small examples, but there seems to be persistent VM-crashing bugs that happen when they get inlined by C2.  I haven't been able to identify what causes them yet, but I do know that if you disable C2, the crashing doesn't occur.  The examples include commented out loops that you can uncomment to replicate the behavior.
No surprise, only C2 specializes & embeds snippets into generated code. 
C1/interpreter call stand-alone versions.

After fixing a couple of bugs [1] with double/long arguments, I hit a 
SEGV in the snippet code:

$ java ... -XX:+PrintCodeSnippets -Dpanama.CodeSnippet.DEBUG=true 
-XX:CompileCommand=print,com.oracle.vector.Float256Vector::intoArray 
AddArrays
...
Decoding specialized code snippet "mm256_vmovups_store":
   Context:  401    b        com.oracle.vector.Float256Vector::intoArray 
(24 bytes)
   Registers: (r11,r10,xmm0)BAD
   Code: c4 81 7c 11 04 5b
   0x00000001188890f0: vmovups %ymm0,(%r11,%r11,2)
...
Compiled method (c2)   32823  401 
com.oracle.vector.Float256Vector::intoArray (24 bytes)
...
  main code      [0x0000000105bbca40,0x0000000105bbcae0] = 160
...
   0x0000000105bbcaab: movslq %eax,%r10
   0x0000000105bbcaae: mov    (%rsp),%r11

  ;; snippet "mm256_vmovups_store" {

   0x0000000105bbcab2: vmovups %ymm0,(%r11,%r11,2)

  ;; } snippet "mm256_vmovups_store"
...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010688a2b2, pid=28380, tid=6147
#
# JRE version: Java(TM) SE Runtime Environment (9.0) (slowdebug build 
9-internal+0-2016-05-26-154016.vlivanov.panama)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (slowdebug 
9-internal+0-2016-05-26-154016.vlivanov.panama, mixed mode, compressed 
oops, g1 gc, bsd-amd64)
# Problematic frame:
# J 411 C2 com.oracle.vector.Float256Vector.intoArray([FI)V (24 bytes) @ 
0x000000010688a2b2 [0x000000010688a240+0x0000000000000072]


With a trivial workaround [1] it works fine:

Decoding specialized code snippet "mm256_vmovups_store":
   Context:  443 %  b        AddArrays::addArrays @ 19 (72 bytes)
   Registers: (r11,r10,xmm0)BAD
   Code: c4 81 7c 11 04 13
   0x00007fc9cc00e1a0: vmovups %ymm0,(%r11,%r10,1)

After that, all your tests pass.

> One thing we've been particularly interested in is the memory overhead of this API.  In a discussion with John, Paul, and Mikael we had a month or so ago, one idea that came up was that we could lean pretty heavily on escape analysis to head off the object allocation issue.  I've been able to do some quick sampling of the heap using jmap to get some histograms, and it seems like the Vector API makes little if any appearance on-heap.  That is to say that escape analysis seems to be working on that part of the stack.  The escape analysis pass seems to have trouble reasoning about the Long2/4/8 data types when they appear in loops, however.  In other cases it does work.  We've noticed a significant memory usage when putting Vector operations in tight loops like one might find in a libblas sgemm implementation.  A cursory analysis would seem that a lot of Long2/4/8 objects are making it to the heap that don't need to be going there.  All things considered, though, escape analysis is working very well!

Long2/Long4/Long8 are intended to be value classes (explored in Project 
Valhalla [3]). EA does a decent job, but it still has to be conservative 
about box identity for escaping objects. It should be fixed with 
"heisenboxes" for value classes, but they are not supported yet.

Moreover, EA implementation itself in C2 has some weak points. Where 
Partial EA (implemented in Graal) can push allocations into slow paths 
and eliminate boxing/unboxing in hot code, C2 mostly gives up when 
there's a chance the object escapes.

Also, the current implementation is flow-insensitive, so it bails out 
when there is a possibility an address points to multiple objects [4] 
[5]. That's the reason boxes are not eliminated in loops. I noticed that 
when working on VectorizedHashCode sample [6]: though loop accumulator 
doesn't escape, it is boxed/unboxed on every loop iteration.

I'll spend some time looking at low hanging fruits in C2 EA, but value 
classes are much better fit for vector values. So, let's keep an eye on 
what happens in Project Valhalla.

>
> The HOF components of the Vector API are still implemented in scalar form in this code.  Right now I'm focused on getting the concrete methods working well before putting more cycles into the higher order bits.
>
> Let me know your thoughts.
>
> The webrev is located here:  http://cr.openjdk.java.net/~vdeshpande/Panama_Collaboration/webrev.01/

Looks awesome! I'll wait until you fix a bug in register encoding and 
push it into the repository.

Best regards,
Vladimir Ivanov

[1] http://hg.openjdk.java.net/panama/panama/hotspot/rev/380860fc02cf

[2]

diff --git 
a/test/panama/vector-api-patchable/src/main/java/com/oracle/vector/PatchableVecUtils.java 
b/test/panama/vector-api-patchable/src/main/java/com/oracle/vector/PatchableVecUtils.java
--- 
a/test/panama/vector-api-patchable/src/main/java/com/oracle/vector/PatchableVecUtils.java
+++ 
b/test/panama/vector-api-patchable/src/main/java/com/oracle/vector/PatchableVecUtils.java
@@ -141,7 +141,7 @@
        assert (scale & 0x3) == scale;
        assert (index.encoding() & 0x7) == index.encoding();
        assert (base.encoding() & 0x7) == base.encoding();
-      return (scale << 6) | (index.encoding() << 3) | base.encoding();
+      return (scale << 6) | ((index.encoding() & 0x7) << 3) | 
(base.encoding() & 0x7);
      }

[3] http://openjdk.java.net/projects/valhalla/

[4] 
http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/file/4fce6a99a262/src/share/vm/opto/escape.cpp#l1733

[5] https://bugs.openjdk.java.net/browse/JDK-6853701

[6] 
http://hg.openjdk.java.net/panama/panama/jdk/file/c5a104d33632/test/panama/snippets/VectorizedHashCode.java#l64