[vector-api][X86] Intrinsic creation for missing cases of iotaShuffle API

Wed Nov 13 19:36:56 UTC 2019

On Nov 13, 2019, at 7:53 AM, Bhateja, Jatin <jatin.bhateja at intel.com> wrote:
> 
> Hi All,
> 
> Please find below a patch which intrinsifies following missing cases of iotaShuffle :-
> 
>  1.  scale > 1.
>  2.  wrap = false.
> 
> http://cr.openjdk.java.net/~jbhateja/vectorIntrinsics/IotaShuffle/webrev.00/

Thank you Jatin for fixing some of my “FIXME” comments!

In AbstractSpecies.java there seems to be a funny 2-space indentation.
Please align it to the ambient 4-space convention.

In loopTransform.cpp I would encourage you to make your change more incremental.
Your change re-orders the existing predicate terms while inserting the new term at the
beginning.  It is more conventional to add new logic at the end, while retaining the flow
of the old logic.  I think breaking the lines *is* a good idea, so something like this is good:

-      if (opc == Op_StoreP || opc == Op_StoreN || opc == Op_StoreNKlass || opc == Op_StoreCM) {
+-      if (opc == Op_StoreVector || opc == Op_StoreP || 
++      if (opc == Op_StoreP || 
+           opc == Op_StoreN || opc == Op_StoreNKlass ||
+-          opc == Op_StoreCM) {
++          opc == Op_StoreCM || opc == Op_StoreVector) {

My biggest comment (which is still pretty minor) is about VectorIntrinsics.java.

The handling of arguments in the intrinsic is irregular and I think this is the right moment
to make it align better to the other intrinsics.  Specifically, the arguments to the lambda
should precede the lambda, in the same order as they are passed to the lambda.

This means that the lambda should take the following form:

     interface ShuffleIotaOperation<E> {

-        VectorShuffle<E> apply(int step, int length);
+-        VectorShuffle<E> apply(int start, int step, int length);
++        VectorShuffle<E> apply(int length, int start, int step, int wrap);

     }

I don’t know why length and start got swapped in the first place; now is the
time to fix it.  Also, wrap should be present to the lambda.  I suggest making
wrap a true boolean; there’s no reason to lower it to an int, and that loses
information about the range of the value.

You can still use two lambdas if you want for the fallback, one for each value
of wrap.  But I suspect using one would be cleaner.

(I’ll defer to Vladimir on this; he may have more information about what’s right here.)

The above adjustments might make it convenient to simplify the AbstractVector
API to have just one method with a boolean rather than two methods.

A couple comments about library_call.cpp:  There are two places where mod_val
is computed as num_elem-1 and then applied with Op_AndI.  The intrinsic must
check is_power_of_2(num_elem) and fail the intrinsic if that happens.  This
may happen with ARM SVE—in theory.

Also, if step is the constant 1 then there is no need to perform the Op_MulI or
to build the VectorBlendNode which handles overflow at the end; all indexes
will be in range.  This case should be covered even though the Java code
attempts to cover it also.  The optimizer might discover a constant value of
step=1 that the Java code misses.

There might be a couple kinds of technical debt here:  First, we want to
strength-reduce Op_MulI to Op_LShiftI (or nop) when possible.  Does
this happen downstream of the library_call.cpp code?  Second, we
want to keep track of what happens when the multiplication overflows
because step is large, and/or the lane size is small.  That behavior has
to be specified and followed both in Java code and the intrinsic—and
tested.  (Also, the VectorBlendNode should nop itself away if the step
turns out to be constant 1 after parsing.  Not sure how hard this is.)

Reviewed.

— John