[vector-api][X86] Intrinsic creation for missing cases of iotaShuffle API

Viswanathan, Sandhya sandhya.viswanathan at intel.com
Sun Nov 24 03:13:20 UTC 2019


Hi Jatin,

Thanks for incorporating the suggestions.  Couple of improvements:
1) In library_call.cpp, if num_elem is not power of 2, return false (no intrinsic).
2) In library_call.cpp, lines 7151 to 7166 only needs to be done if (step != 1 || start != 0) or if either of them is not a constant.

Best Regards,
Sandhya

-----Original Message-----
From: Bhateja, Jatin <jatin.bhateja at intel.com> 
Sent: Friday, November 22, 2019 4:20 AM
To: John Rose <john.r.rose at oracle.com>; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Cc: panama-dev at openjdk.java.net
Subject: RE: [vector-api][X86] Intrinsic creation for missing cases of iotaShuffle API

Hi  John, Sandhya,

Please find below an updated patch which does suggested strength reductions.

http://cr.openjdk.java.net/~jbhateja/vectorIntrinsics/IotaShuffle/webrev.01/

My understanding is that vector-API is a base level API and will be used intelligently by the users, thus apart from macro APIs like iotaShuffle , addIndex (may be few others) compiler based strength reduction may not be needed. 

For auto-vectorization flow, scalar graph which is fed to SuperWord is already strength reduced by construction since Node*::Identity routines does the necessary job.

Best Regards,
Jatin

> -----Original Message-----
> From: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> Sent: Thursday, November 14, 2019 3:30 AM
> To: John Rose <john.r.rose at oracle.com>; Bhateja, Jatin 
> <jatin.bhateja at intel.com>
> Cc: panama-dev at openjdk.java.net
> Subject: RE: [vector-api][X86] Intrinsic creation for missing cases of 
> iotaShuffle API
> 
> Hi John,
> 
> I like your suggestion to check if the step is constant and has a 
> value 1 in library_call.cpp before generating the MulI node. Likewise, 
> the AddI node generation can be avoided if the start is 0.
> 
> The Blend node generation can only be removed if the start also is 0 
> along with step being 1 at parsing time. If start is non zero, then 
> index can go out of range. Similarly, the do_wrap path can be optimized for start=0, step=1.
> 
> I don’t think any strength reduction is happening today on vector 
> nodes after generation.
> 
> Best Regards,
> Sandhya
> 
> -----Original Message-----
> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of 
> John Rose
> Sent: Wednesday, November 13, 2019 11:37 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: panama-dev at openjdk.java.net
> Subject: Re: [vector-api][X86] Intrinsic creation for missing cases of 
> iotaShuffle API
> 
> On Nov 13, 2019, at 7:53 AM, Bhateja, Jatin <jatin.bhateja at intel.com> wrote:
> >
> > Hi All,
> >
> > Please find below a patch which intrinsifies following missing cases 
> > of iotaShuffle :-
> >
> >  1.  scale > 1.
> >  2.  wrap = false.
> >
> > http://cr.openjdk.java.net/~jbhateja/vectorIntrinsics/IotaShuffle/we
> > br
> > ev.00/
> 
> Thank you Jatin for fixing some of my “FIXME” comments!
> 
> In AbstractSpecies.java there seems to be a funny 2-space indentation.
> Please align it to the ambient 4-space convention.
> 
> In loopTransform.cpp I would encourage you to make your change more 
> incremental.
> Your change re-orders the existing predicate terms while inserting the 
> new term at the beginning.  It is more conventional to add new logic 
> at the end, while retaining the flow of the old logic.  I think 
> breaking the lines *is* a good idea, so something like this is good:
> 
> -      if (opc == Op_StoreP || opc == Op_StoreN || opc == Op_StoreNKlass ||
> opc == Op_StoreCM) {
> +-      if (opc == Op_StoreVector || opc == Op_StoreP ||
> ++      if (opc == Op_StoreP ||
> +           opc == Op_StoreN || opc == Op_StoreNKlass ||
> +-          opc == Op_StoreCM) {
> ++          opc == Op_StoreCM || opc == Op_StoreVector) {
> 
> My biggest comment (which is still pretty minor) is about VectorIntrinsics.java.
> 
> The handling of arguments in the intrinsic is irregular and I think 
> this is the right moment to make it align better to the other 
> intrinsics.  Specifically, the arguments to the lambda should precede 
> the lambda, in the same order as they are passed to the lambda.
> 
> This means that the lambda should take the following form:
> 
>      interface ShuffleIotaOperation<E> {
> 
> -        VectorShuffle<E> apply(int step, int length);
> +-        VectorShuffle<E> apply(int start, int step, int length);
> ++        VectorShuffle<E> apply(int length, int start, int step, int 
> ++ wrap);
> 
>      }
> 
> I don’t know why length and start got swapped in the first place; now 
> is the time to fix it.  Also, wrap should be present to the lambda.  I 
> suggest making wrap a true boolean; there’s no reason to lower it to 
> an int, and that loses information about the range of the value.
> 
> You can still use two lambdas if you want for the fallback, one for 
> each value of wrap.  But I suspect using one would be cleaner.
> 
> (I’ll defer to Vladimir on this; he may have more information about 
> what’s right here.)
> 
> The above adjustments might make it convenient to simplify the 
> AbstractVector API to have just one method with a boolean rather than 
> two methods.
> 
> A couple comments about library_call.cpp:  There are two places where 
> mod_val is computed as num_elem-1 and then applied with Op_AndI.  The 
> intrinsic must check is_power_of_2(num_elem) and fail the intrinsic if 
> that happens.  This may happen with ARM SVE—in theory.
> 
> Also, if step is the constant 1 then there is no need to perform the 
> Op_MulI or to build the VectorBlendNode which handles overflow at the 
> end; all indexes will be in range.  This case should be covered even 
> though the Java code attempts to cover it also.  The optimizer might 
> discover a constant value of
> step=1 that the Java code misses.
> 
> There might be a couple kinds of technical debt here:  First, we want 
> to strength-reduce Op_MulI to Op_LShiftI (or nop) when possible.  Does 
> this happen downstream of the library_call.cpp code?  Second, we want 
> to keep track of what happens when the multiplication overflows 
> because step is large, and/or the lane size is small.  That behavior 
> has to be specified and followed both in Java code and the 
> intrinsic—and tested.  (Also, the VectorBlendNode should nop itself 
> away if the step turns out to be constant 1 after parsing.  Not sure 
> how hard this is.)
> 
> Reviewed.
> 
> — John




More information about the panama-dev mailing list