VectorAPI: SubAll intrinsics for byte, short, float and double

Tue May 1 16:29:52 UTC 2018

> On May 1, 2018, at 4:33 AM, Halimi, Jean-Philippe <jean-philippe.halimi at intel.com> wrote:
> 
> Hi Paul,
> 
> From my understanding, it is possible to reduce subAll() into addAll().neg() in the case of integral types. Horizontal arithmetic is allowed in this case because the order of operation does not matter. In fact, the add reduction we transform subAll into does that.

Excellent!

The second patch (webrev_subAll_BS_v1.0) looks good.

> For FP, you are right saying we need to keep the order because of the limited precision.

Right,  but we also have some wiggle room to change that order, trading predictable results (compared to the scalar operation) for speed. It all depends on how we specify the behavior and whether we require some configuration loosing controlling precision.

> In this case, I am not aware of a data parallel approach we could use to speed up the computation.
> 

e.g.

  v1 = a b c d

  v2 = shuffle v1 = c d c d

  v3 = v1 + v2 = (a + c) (b + d) … … 

  v4 = shuffle v3 = (b + d) (b + d) … …

  v5 = v4 + v3 = ((a + c) + (b + d)) … … … 

For FP we may need to retain sequential and data parallel approaches.

There is probably a rich vein of academic literature on this topic, perhaps including vectorized khan summation (something that if important we should add as a separate operation IMHO to return two values, the sum and sum with compensation).

> Let me know if I missed your point.
> 

You got it.

Thanks,
Paul.

> Thanks,
> 
> Jp
> 
> -----Original Message-----
> From: Paul Sandoz [mailto:paul.sandoz at oracle.com] 
> Sent: Monday, April 30, 2018 4:51 PM
> To: Halimi, Jean-Philippe <jean-philippe.halimi at intel.com>
> Cc: panama-dev at openjdk.java.net
> Subject: Re: VectorAPI: SubAll intrinsics for byte, short, float and double
> 
> Hi Jp,
> 
> Looks ok. Can we derive subAll from addAll().neg(), the additional negation might be an acceptable cost but i am uncertain of the FP behavior.
> 
> IIUC, for reductive addition or subtraction, the accumulated value is kept in first lane of the destination register and the src lane element to subtract is shuffled down for each iteration. In effect it preserves the sequential order, but i wonder if there are faster data parallel approaches if we are relaxed about rounding producing different results?
> 
> Thanks,
> Paul.
> 
>> On Apr 30, 2018, at 10:10 AM, Halimi, Jean-Philippe <jean-philippe.halimi at intel.com> wrote:
>> 
>> Hi all,
>> 
>> 
>> 
>> I would like to share a patch adding support for subAll intrinsic for byte, short, long, float and double types in VectorAPI.
>> 
>> 
>> 
>> Could you please review the two following patches?
>> 
>> http://cr.openjdk.java.net/~jphalimi/webrev_subAll_FP_v1.1/
>> 
>> http://cr.openjdk.java.net/~jphalimi/webrev_subAll_BS_v1.0/
>> 
>> 
>> 
>> Thank you,
>> 
>> 
>> 
>> Jp
>> 
>