[vector] Vector API -- alignment with value types

Viswanathan, Sandhya sandhya.viswanathan at intel.com
Fri Feb 1 00:00:18 UTC 2019


Hi Brian,



I applied your two patches and did some experiments and the result is encouraging.



I started with the small test kernel as below:



   static final FloatVector.FloatSpecies SPECIES =  FloatVector.species(Shape.S_256_BIT);

  static float[] a = new float[SIZE];

  static float[] b = new float[SIZE];

   static float[] c = new float[SIZE];



    static void workload() {

        for (int i = 0; i < a.length; i += SPECIES.length()) {

            FloatVector av = FloatVector.fromArray(SPECIES, a, i);

            FloatVector bv = FloatVector.fromArray(SPECIES, b, i);

            av.add(bv).intoArray(c, i);

        }

    }


Your patch only had the fromByteArray flavor of the factory method as part of the XXXVector class.

To compile and execute the above I needed to add the fromArray factory method to the FloatVector class.

Adding that on similar lines as the fromByteArray method to FloatVector.java:

    @ForceInline

    @SuppressWarnings("unchecked")

    public static FloatVector fromArray(FloatSpecies species, float[] a, int ix) {

        Objects.requireNonNull(a);

        ix = VectorIntrinsics.checkIndex(ix, a.length, species.length());

        return VectorIntrinsics.load((Class<FloatVector>) species.boxType(), float.class, species.length(),

                                     a, (((long) ix) << ARRAY_SHIFT) + Unsafe.ARRAY_FLOAT_BASE_OFFSET,

                                     a, ix,

                                     (c, idx) -> species.op(n -> c[idx + n]));

    }



With this change, when I execute the test kernel as below:

$JAVA_HOME/bin/java --add-modules=jdk.incubator.vector  -XX:CompileCommand=print,mytest.workload  -XX:CompileCommand=dontinline,mytest.workload   -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0 -XX:-UseSuperWord -XX:+PrintInlining mytest >& out



All the load/store and binaryOp intrinsics continue to happen:

$ grep VectorIntrinsics:: out | grep intrinsic

                              @ 47   jdk.incubator.vector.VectorIntrinsics::load (12 bytes)   failed to inline (intrinsic)

                                @ 26   jdk.incubator.vector.VectorIntrinsics::binaryOp (12 bytes)   (intrinsic)

                              @ 43   jdk.incubator.vector.VectorIntrinsics::store (14 bytes)   (intrinsic)

                                @ 47   jdk.incubator.vector.VectorIntrinsics::load (12 bytes)   (intrinsic)

                                @ 47   jdk.incubator.vector.VectorIntrinsics::load (12 bytes)   (intrinsic)

                                  @ 26   jdk.incubator.vector.VectorIntrinsics::binaryOp (12 bytes)   (intrinsic)

                                @ 43   jdk.incubator.vector.VectorIntrinsics::store (14 bytes)   (intrinsic)

                                @ 47   jdk.incubator.vector.VectorIntrinsics::load (12 bytes)   (intrinsic)

                                @ 47   jdk.incubator.vector.VectorIntrinsics::load (12 bytes)   (intrinsic)

                                  @ 26   jdk.incubator.vector.VectorIntrinsics::binaryOp (12 bytes)   (intrinsic)

                                @ 43   jdk.incubator.vector.VectorIntrinsics::store (14 bytes)   (intrinsic)



And the code generated for the hot loop is efficient as before:

;; B7: #       B7 B8 <- B6 B7  Loop: B7-B7 inner main of N69 Freq: 128.015

  0x00007f98500eba50: vmovdqu 0x10(%r11,%rdi,4),%ymm0

  0x00007f98500eba57: vaddps 0x10(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500eba5e: vmovdqu %ymm0,0x10(%r8,%rdi,4)

  0x00007f98500eba65: vmovdqu 0x30(%r11,%rdi,4),%ymm0

  0x00007f98500eba6c: vaddps 0x30(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500eba73: vmovdqu %ymm0,0x30(%r8,%rdi,4)

  0x00007f98500eba7a: vmovdqu 0x50(%r11,%rdi,4),%ymm0

  0x00007f98500eba81: vaddps 0x50(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500eba88: vmovdqu %ymm0,0x50(%r8,%rdi,4)

  0x00007f98500eba8f: vmovdqu 0x70(%r11,%rdi,4),%ymm0

  0x00007f98500eba96: vaddps 0x70(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500eba9d: vmovdqu %ymm0,0x70(%r8,%rdi,4)

  0x00007f98500ebaa4: vmovdqu 0x90(%r11,%rdi,4),%ymm0

  0x00007f98500ebaae: vaddps 0x90(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500ebab8: vmovdqu %ymm0,0x90(%r8,%rdi,4)

  0x00007f98500ebac2: vmovdqu 0xb0(%r11,%rdi,4),%ymm0

  0x00007f98500ebacc: vaddps 0xb0(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500ebad6: vmovdqu %ymm0,0xb0(%r8,%rdi,4)

  0x00007f98500ebae0: vmovdqu 0xd0(%r11,%rdi,4),%ymm0

  0x00007f98500ebaea: vaddps 0xd0(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500ebaf4: vmovdqu %ymm0,0xd0(%r8,%rdi,4)

  0x00007f98500ebafe: vmovdqu 0xf0(%r11,%rdi,4),%ymm0

  0x00007f98500ebb08: vaddps 0xf0(%r10,%rdi,4),%ymm0,%ymm0

  0x00007f98500ebb12: vmovdqu %ymm0,0xf0(%r8,%rdi,4)

  0x00007f98500ebb1c: add    $0x40,%edi

  0x00007f98500ebb1f: cmp    %ebx,%edi

  0x00007f98500ebb21: jl     0x00007f98500eba50



I plan to try more complex cases tomorrow.



Best Regards,

Sandhya



-----Original Message-----
From: panama-dev [mailto:panama-dev-bounces at openjdk.java.net] On Behalf Of Brian Goetz
Sent: Wednesday, January 30, 2019 1:54 PM
To: panama-dev at openjdk.java.net
Subject: Re: [vector] Vector API -- alignment with value types



I've tried to attach the patches-in-progress for these.  There are two, the speciesFactories.patch I set recently, and then zero.patch.  These only include changes to Vector and the templates; you'll have to re-run the gen-* scripts to get changes to the rest.  They apply to the last version for which the tests pass.



Perhaps someone could look at the generated code from typical operations and see how badly this approach perturbs it?







On 1/30/2019 3:58 PM, Brian Goetz wrote:

>

>

>> Part I

>> ------

>>

>> Here's an idea for simplifying Species, which is: let's drive Species

>> down to be a simple class that is really just a constant holder for

>> (element type, shape), and move all the behavior to static methods on

>> XxxVector.

>

> I've started prototyping this, in a rather slash-and-burn manner. I am

> about halfway through.  So far, its working, the set of changes I had

> to make to client code is very small (almost all transforming

> species.doSomething(args) to XxxVector.doSomething(species, args)).

>

> The question, of course, is whether the intrinsification will all

> survive the transformation.  The most common case is that I have

> transformed vector intrinsic calls from

>

>     VectorIntrinsic.blah(Int128Vector.class, ...)

>

> to

>

>     VectorIntrinsics.blah(species.boxType(), ...)

>

> The basic assumption is that, under the same conditions that we get

> inlining now, we'll know the species is a constant, and boxType() will

> just inline to a concrete box type anyway.  (The goal is to get

> species to be values, but in the meantime, they can be enum

> constants.)  This should work, but we'll likely have to do some JIT

> work to get back to where we see all the inlining and

> intrinsification.  (Much of this would come free in Valhalla.)

>

> There are a few cases where we can't just do the above, and have to do

> a switch in the lifted method, such as:

>

> public static Shuffle<Byte> shuffle(ByteSpecies species,

> IntUnaryOperator f) {

>     if (species.boxType() == ByteMaxVector.class)

>         return new ByteMaxVector.ByteMaxShuffle(f);

>     switch (species.bitSize()) {

>         case 64:return new Byte64Vector.Byte64Shuffle(f);

>         case 128:return new Byte128Vector.Byte128Shuffle(f);

>         case 256:return new Byte256Vector.Byte256Shuffle(f);

>         case 512:return new Byte512Vector.Byte512Shuffle(f);

>         default:throw new

> IllegalArgumentException(Integer.toString(species.bitSize()));

>     }

> }

>

>

> Because again, species is a constant, this should also just inline

> down to the right thing.  So far, other than reshape/rebracket, I

> haven't seen anything requiring more complicated transformations.

>

> The only code I found so far that tries to be agnostic to shape and

> size both is VectorResizeTest; there are strategies for making this

> work without the combinatorial automated specialization, so I don't

> see this as a big impediment.

>

> Where this leads to is:

>  - Vector.Species becomes an interface with a handful of methods

> (bitSize, etc), and quite possibly one that no one uses directly;

>  - IntSpecies and friends become enums, with enum constants for I64,

> I128, etc (or, values, with static constants for the values);

>  - The specialized classes for XxxNnVector.XxxNnSpecies _go away_;

>  - Users need not learn about species at all, but if they do care,

> they are just simple data constants that get fed through the API.

>

> I'm not done (and am traveling the next two weeks), but I think I've

> made progress validating the API transformation.  The real question,

> then, is when do we do this.  I think it would be best to do before

> previewing, simply because it is such an intrusive refactoring. But,

> we'd have to evaluate whether we can mitigate the impact in time.

>

>




More information about the panama-dev mailing list