[vector-api][X86] Intrinsic creation for missing cases of iotaShuffle API

Wed Nov 13 22:00:13 UTC 2019

Hi John,

I like your suggestion to check if the step is constant and has a value 1 in library_call.cpp before generating the MulI node. Likewise, the AddI node generation can be avoided if the start is 0.

The Blend node generation can only be removed if the start also is 0 along with step being 1 at parsing time. If start is non zero, then index can go out of range. Similarly, the do_wrap path can be optimized for start=0, step=1. 

I don’t think any strength reduction is happening today on vector nodes after generation. 

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of John Rose
Sent: Wednesday, November 13, 2019 11:37 AM
To: Bhateja, Jatin <jatin.bhateja at intel.com>
Cc: panama-dev at openjdk.java.net
Subject: Re: [vector-api][X86] Intrinsic creation for missing cases of iotaShuffle API

On Nov 13, 2019, at 7:53 AM, Bhateja, Jatin <jatin.bhateja at intel.com> wrote:
> 
> Hi All,
> 
> Please find below a patch which intrinsifies following missing cases 
> of iotaShuffle :-
> 
>  1.  scale > 1.
>  2.  wrap = false.
> 
> http://cr.openjdk.java.net/~jbhateja/vectorIntrinsics/IotaShuffle/webr
> ev.00/

Thank you Jatin for fixing some of my “FIXME” comments!

In AbstractSpecies.java there seems to be a funny 2-space indentation.
Please align it to the ambient 4-space convention.

In loopTransform.cpp I would encourage you to make your change more incremental.
Your change re-orders the existing predicate terms while inserting the new term at the beginning.  It is more conventional to add new logic at the end, while retaining the flow of the old logic.  I think breaking the lines *is* a good idea, so something like this is good:

-      if (opc == Op_StoreP || opc == Op_StoreN || opc == Op_StoreNKlass || opc == Op_StoreCM) {
+-      if (opc == Op_StoreVector || opc == Op_StoreP || 
++      if (opc == Op_StoreP ||
+           opc == Op_StoreN || opc == Op_StoreNKlass ||
+-          opc == Op_StoreCM) {
++          opc == Op_StoreCM || opc == Op_StoreVector) {

My biggest comment (which is still pretty minor) is about VectorIntrinsics.java.

The handling of arguments in the intrinsic is irregular and I think this is the right moment to make it align better to the other intrinsics.  Specifically, the arguments to the lambda should precede the lambda, in the same order as they are passed to the lambda.

This means that the lambda should take the following form:

     interface ShuffleIotaOperation<E> {

-        VectorShuffle<E> apply(int step, int length);
+-        VectorShuffle<E> apply(int start, int step, int length);
++        VectorShuffle<E> apply(int length, int start, int step, int 
++ wrap);

     }

I don’t know why length and start got swapped in the first place; now is the time to fix it.  Also, wrap should be present to the lambda.  I suggest making wrap a true boolean; there’s no reason to lower it to an int, and that loses information about the range of the value.

You can still use two lambdas if you want for the fallback, one for each value of wrap.  But I suspect using one would be cleaner.

(I’ll defer to Vladimir on this; he may have more information about what’s right here.)

The above adjustments might make it convenient to simplify the AbstractVector API to have just one method with a boolean rather than two methods.

A couple comments about library_call.cpp:  There are two places where mod_val is computed as num_elem-1 and then applied with Op_AndI.  The intrinsic must check is_power_of_2(num_elem) and fail the intrinsic if that happens.  This may happen with ARM SVE—in theory.

Also, if step is the constant 1 then there is no need to perform the Op_MulI or to build the VectorBlendNode which handles overflow at the end; all indexes will be in range.  This case should be covered even though the Java code attempts to cover it also.  The optimizer might discover a constant value of
step=1 that the Java code misses.

There might be a couple kinds of technical debt here:  First, we want to strength-reduce Op_MulI to Op_LShiftI (or nop) when possible.  Does this happen downstream of the library_call.cpp code?  Second, we want to keep track of what happens when the multiplication overflows because step is large, and/or the lane size is small.  That behavior has to be specified and followed both in Java code and the intrinsic—and tested.  (Also, the VectorBlendNode should nop itself away if the step turns out to be constant 1 after parsing.  Not sure how hard this is.)

Reviewed.

— John