On generalizing vector intrinsics

Thu Nov 16 18:04:19 UTC 2017

> Ok. Of the other options 2) seems more appealing, but 1) is more natural to the coder. Perhaps lets wait and see how this sits when considering the other operations.

FTR there's no difference for C2 between all 3 options. For the 
following toy example, it always produces the same code:

static int workload256(int i) {
   Vector.Species<Integer,Shapes.S256Bit> species = ...
   return species.broadcast(i).sumAll();
}

# replicate8I
vmovd  %ebp,%xmm0
vpshufd $0x0,%xmm0,%xmm0
vinserti128 $0x1,%xmm0,%ymm0,%ymm0

JDK patch w/ broadcast implementation:

http://cr.openjdk.java.net/~vlivanov/panama/vector.generalized_intrinsics.broadcast

(Lots of code duplication. Deliberately kept intrinsic variants separate.)

> It would be a nice stretch goal to ensure that a scalar loop is as fast as the vector encoded loop when no intrinsics are applied or supported (the graceful degradation goal). It may be unrealistic but it still might influence what we do.

Good point. My understanding is no matter what the do in default 
implementations, the prerequisite for competitive performance is C2 
being able to "scalarize away" vector boxes and work directly on their 
components. Basically, the same problem we fight with for "intrinsified" 
case. I believe vector support in JVM can be generalized and cover 
non-intrinsifed case as well, but there are some additional 
complications (e.g., memory layout of vectors w.r.t. alias analysis in 
C2).

Best regards,
Vladimir Ivanov

>>
>> There are other options:
>>
>> (1) box the value and rely on EA
>>
>>     static
>>     <V extends Vector<?,?>>
>>     V broadcastBoxed(Class<?> vectorClass, int elementType, int bitSize,
>>                      Number con, /*Vector.Mask<E,S> m,*/
>>                      Function<Number,V> defaultImpl) {
>>         return defaultImpl.apply(con);
>>     }
>>
>> final class Int256Vector extends IntVector<Shapes.S256Bit> {
>> ...
>>   public Int256Vector broadcast(int e) {
>>       return VectorIntrinsics.broadcastBoxed(
>>       		Int256Vector.class, VECTOR_ELEM_INT, 256,
>>                 (Number)e,
>>                 (c -> SPECIES.op(k -> (int)c)));
>>   }
>>
>> Primitive value is boxed first and then the intrinsic should unbox it itself. Once it's done and temporary box isn't used anymore, EA should be able to eliminate it.
>>
>> (2) coerce all values to 64-bits (e.g, to long)
>>
>>     static
>>     <V extends Vector<?,?>>
>>     V broadcastCoerced(Class<?> c, int elem, int size,
>>                                  long bits,
>>                                  LongFunction<V> defaultImpl) {
>>         return defaultImpl.apply(bits);
>>     }
>>   final class Double256Vector extends DoubleVector<Shapes.S256Bit> {
>> ...
>>     public Double256Vector broadcast(double e) {
>>         return VectorIntrinsics.broadcastCoerced(
>>                   Double256Vector.class, VECTOR_ELEM_DOUBLE, 256,
>>                   Double.doubleToLongBits(e),
>>                   (l -> SPECIES.op(i -> Double.longBitsToDouble(l))));
>>     }
>>
>> No boxing, but C2 should be able to eliminate redundant primitive conversions (like, double -> long -> double) to produce the same code as for broadcast(..., double d, ...).
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>>   @ForceInline
>>>>   public Double128Vector add(Vector<Double,Shapes.S128Bit> o) {
>>>>       return (Double128Vector) VectorIntrinsics.binary(
>>>>               VECTOR_OP_ADD, Double128Vector.class, VECTOR_ELEM_DOUBLE, 128,
>>>>               this, (Double128Vector)o,
>>>>               (v1, v2) -> ((Double128Vector)v1).bOp(v2, (i, a, b) -> (a + b)));
>>>>   }
>>>>
>>>>
>>>> Indeed, C2 can eliminate the indirection, but the main benefit would be to optimize the implementation for the particular vector shape and use it (Int256Vector.add vs IntVector.bOp()).
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>>     Vector binaryOp(int opr, int elem, int size, Vector v1, Vector v2, VectorBinaryOp javaOp)
>>>>>    (Int256Vector) VectorIntrinsics.binaryOp(VECTOR_OP_ADD,
>>>>>                                              VECTOR_ELEM_INT,
>>>>>                                              256,
>>>>>                                              this,
>>>>>                                              (Int256Vector)o,
>>>>>                                              this::_add);
>>>>
>>>>
>>>>
>>>>>> On 14 Nov 2017, at 14:57, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> FYI I did a quick experiment with more generic vector intrinsics and wanted to share the first results. The motivation was to explore possible reduction in the number of intrinsics needed to support Vector API in the JVM.
>>>>>>
>>>>>> The first candidates were binary arithmetic operations on vectors:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~vlivanov/panama/vector.generalized_intrinsics/
>>>>>>
>>>>>> Overall, it looks promising.
>>>>>>
>>>>>> It can be easily extended to masked variants (e.g., by adding additional argument for the mask and pass null for all-ones mask) and operation variants (e.g., saturated add).
>>>>>>
>>>>>> Dispatching is simple on JVM side. One question is how to represent vector shape (element type + vector size). There are different options:
>>>>>>
>>>>>>   (1) pass parameters explicitly (prototyped):
>>>>>>
>>>>>>     Vector binaryOp(int opr, int elem, int size, Vector v1, Vector v2)
>>>>>>
>>>>>>     (Int256Vector) VectorIntrinsics.binaryOp(VECTOR_OP_ADD,
>>>>>>                                              VECTOR_ELEM_INT,
>>>>>>                                              256,
>>>>>>                                              this,
>>>>>>                                              (Int256Vector)o);
>>>>>>
>>>>>>
>>>>>>   (2) pass concrete vector class and extract vector shape info from it
>>>>>>
>>>>>>     Vector binaryOp(int opr, Class vector_box, Vector v1, Vector v2)
>>>>>>
>>>>>>
>>>>>>   (3) pass shape implicitly: extract the shape from the class of the first vector (always exact class):
>>>>>>
>>>>>> final class Int256Vector extends IntVector<Shapes.S256Bit> {
>>>>>> ...
>>>>>>     @Override
>>>>>>     @ForceInline
>>>>>>     public Int256Vector add(Vector<Integer,Shapes.S256Bit> v) {
>>>>>>         return (Int256Vector) VectorIntrinsics.binaryOp(VECTOR_OP_ADD,
>>>>>> 							VECTOR_ELEM_INT,
>>>>>> 							256,
>>>>>> 							this,
>>>>>> 						       (Int256Vector)v);
>>>>>>     }
>>>>>>
>>>>>>
>>>>>> Some considerations:
>>>>>>
>>>>>> #1: explicit and trivial to extract info on C2 side. The downside is that it requires additional non-trivial steps to find exact vector box class (see get_exact_klass_for_vector_box() and ctx->find_klass(vector_klass_name) there): the fact that vector classes aren't part of java.base, but jdk.incubator.vector complicates class lookup a bit.
>>>>>>
>>>>>> #2: vector shape is clearly documented in the code, but requires some additional steps (or hard-coded info in the JVM) to extract different pieces.
>>>>>>
>>>>>> #3: relies on implicit convention that JIT knows exact type for "this": that's the case if the usage is in concrete vector class and first vector argument is "this".
>>>>>>
>>>>>> Personall, I'm in favor of #2, but other options look attractive as well.
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>