[vector-api][X86] Intrinsic creation for missing cases of iotaShuffle API

Tue Dec 3 04:44:40 UTC 2019

On Nov 22, 2019, at 4:20 AM, Bhateja, Jatin <jatin.bhateja at intel.com> wrote:
> 
> Hi  John, Sandhya,
> 
> Please find below an updated patch which does suggested strength reductions.

Thanks for putting this in, and also addressing Sandhya’s comments.
It looks good.

> http://cr.openjdk.java.net/~jbhateja/vectorIntrinsics/IotaShuffle/webrev.01/
> 
> My understanding is that vector-API is a base level API and will be used intelligently by the users,
> thus apart from macro APIs like iotaShuffle , addIndex (may be few others) compiler based
> strength reduction may not be needed. 
> 
> For auto-vectorization flow, scalar graph which is fed to SuperWord is already strength reduced
> by construction since Node*::Identity routines does the necessary job.

That’s fine as long as all the inputs are cleaned up through the scalar
optimizations first, or (as you say above) we can assume that if inputs
are initially vectors (via the Vector API) the users have been smart
about hand-optimizing the scalar aspects of the operations.

But I think we will need to revisit this at issue at some point, because
it’s usually wrong (at least in Java) to depend on upstream optimizations.
So many aspects of Java code are dynamic that there is often extra
optimization opportunity late in the pipeline (i.e., when vectors
are being optimized), that only became visible after earlier low-level
optimizations like constant folding.

Maybe if we move Vector API use cases into the back end of higher-level
frameworks, like expression tree compilers, we can defer late optimization
of vector lane operations, under the assumption that the framework,
like the “intelligent user”, will somehow squeeze most of the optimization
opportunities out.  But I’ve never seen this sort of assumption pan out
in practice.  One way things can go wrong is when two layers of the
system are trying to optimize, and the earlier layer, to get a few wins,
will obscure the code so that latter layer can’t see what it’s doing anymore.

So I would prefer, in the long run, to have the latter parts of the JIT perform
as many optimizations as possible, iteratively to a fixpoint.  I think this is
feasible (not necessarily easy) with vectors and their lanewise scalar operations.

A good example where early optimization falters is when control flow obscures
the value of a parameter K which eventually (later on, perhaps after more inlining)
turns out to be a constant.  In that case the constant may allow strength reduction
(such K=0 or K=1 or K=2^L for multiplication).  But if the constant is already
broadcast to vector lanes, and multiply instruction selection is firm, there’s
no path to strength reduce to a constant (K=0) or copy (K=1) or shift (K=2^L).

I hope we can get to a place where to do robust type narrowing (all the way
down to constants) on all lanes of every vector, just as we do today for scalars.
It will take several steps to get there but it’s desirable and doable IMO.

Thanks for doing this work on iota, and for the interesting discussion.

— John

> Best Regards,
> Jatin
> 
>> -----Original Message-----
>> From: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Sent: Thursday, November 14, 2019 3:30 AM
>> To: John Rose <john.r.rose at oracle.com>; Bhateja, Jatin
>> <jatin.bhateja at intel.com>
>> Cc: panama-dev at openjdk.java.net
>> Subject: RE: [vector-api][X86] Intrinsic creation for missing cases of iotaShuffle
>> API
>> 
>> Hi John,
>> 
>> I like your suggestion to check if the step is constant and has a value 1 in
>> library_call.cpp before generating the MulI node. Likewise, the AddI node
>> generation can be avoided if the start is 0.
>> 
>> The Blend node generation can only be removed if the start also is 0 along
>> with step being 1 at parsing time. If start is non zero, then index can go out of
>> range. Similarly, the do_wrap path can be optimized for start=0, step=1.
>> 
>> I don’t think any strength reduction is happening today on vector nodes after
>> generation.
>> 
>> Best Regards,
>> Sandhya
>> 
>> -----Original Message-----
>> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of
>> John Rose
>> Sent: Wednesday, November 13, 2019 11:37 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: panama-dev at openjdk.java.net
>> Subject: Re: [vector-api][X86] Intrinsic creation for missing cases of iotaShuffle
>> API
>> 
>> On Nov 13, 2019, at 7:53 AM, Bhateja, Jatin <jatin.bhateja at intel.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> Please find below a patch which intrinsifies following missing cases
>>> of iotaShuffle :-
>>> 
>>> 1.  scale > 1.
>>> 2.  wrap = false.
>>> 
>>> http://cr.openjdk.java.net/~jbhateja/vectorIntrinsics/IotaShuffle/webr
>>> ev.00/
>> 
>> Thank you Jatin for fixing some of my “FIXME” comments!
>> 
>> In AbstractSpecies.java there seems to be a funny 2-space indentation.
>> Please align it to the ambient 4-space convention.
>> 
>> In loopTransform.cpp I would encourage you to make your change more
>> incremental.
>> Your change re-orders the existing predicate terms while inserting the new
>> term at the beginning.  It is more conventional to add new logic at the end,
>> while retaining the flow of the old logic.  I think breaking the lines *is* a good
>> idea, so something like this is good:
>> 
>> -      if (opc == Op_StoreP || opc == Op_StoreN || opc == Op_StoreNKlass ||
>> opc == Op_StoreCM) {
>> +-      if (opc == Op_StoreVector || opc == Op_StoreP ||
>> ++      if (opc == Op_StoreP ||
>> +           opc == Op_StoreN || opc == Op_StoreNKlass ||
>> +-          opc == Op_StoreCM) {
>> ++          opc == Op_StoreCM || opc == Op_StoreVector) {
>> 
>> My biggest comment (which is still pretty minor) is about VectorIntrinsics.java.
>> 
>> The handling of arguments in the intrinsic is irregular and I think this is the
>> right moment to make it align better to the other intrinsics.  Specifically, the
>> arguments to the lambda should precede the lambda, in the same order as
>> they are passed to the lambda.
>> 
>> This means that the lambda should take the following form:
>> 
>>     interface ShuffleIotaOperation<E> {
>> 
>> -        VectorShuffle<E> apply(int step, int length);
>> +-        VectorShuffle<E> apply(int start, int step, int length);
>> ++        VectorShuffle<E> apply(int length, int start, int step, int
>> ++ wrap);
>> 
>>     }
>> 
>> I don’t know why length and start got swapped in the first place; now is the
>> time to fix it.  Also, wrap should be present to the lambda.  I suggest making
>> wrap a true boolean; there’s no reason to lower it to an int, and that loses
>> information about the range of the value.
>> 
>> You can still use two lambdas if you want for the fallback, one for each value
>> of wrap.  But I suspect using one would be cleaner.
>> 
>> (I’ll defer to Vladimir on this; he may have more information about what’s
>> right here.)
>> 
>> The above adjustments might make it convenient to simplify the
>> AbstractVector API to have just one method with a boolean rather than two
>> methods.
>> 
>> A couple comments about library_call.cpp:  There are two places where
>> mod_val is computed as num_elem-1 and then applied with Op_AndI.  The
>> intrinsic must check is_power_of_2(num_elem) and fail the intrinsic if that
>> happens.  This may happen with ARM SVE—in theory.
>> 
>> Also, if step is the constant 1 then there is no need to perform the Op_MulI or
>> to build the VectorBlendNode which handles overflow at the end; all indexes
>> will be in range.  This case should be covered even though the Java code
>> attempts to cover it also.  The optimizer might discover a constant value of
>> step=1 that the Java code misses.
>> 
>> There might be a couple kinds of technical debt here:  First, we want to
>> strength-reduce Op_MulI to Op_LShiftI (or nop) when possible.  Does this
>> happen downstream of the library_call.cpp code?  Second, we want to keep
>> track of what happens when the multiplication overflows because step is
>> large, and/or the lane size is small.  That behavior has to be specified and
>> followed both in Java code and the intrinsic—and tested.  (Also, the
>> VectorBlendNode should nop itself away if the step turns out to be constant 1
>> after parsing.  Not sure how hard this is.)
>> 
>> Reviewed.
>> 
>> — John
>