[vector] RFR: Support non power-of-two and 2048-bit vector length for gather load/scatter store

Thu Mar 5 07:07:37 UTC 2020

Hi Paul

Thanks for your review. I have updated the patch. Please check it.
http://cr.openjdk.java.net/~yzhang/vectorapi/vectorapi.2048npow/webrev.01/

Regards
Yang

-----Original Message-----
From: Paul Sandoz <paul.sandoz at oracle.com> 
Sent: Wednesday, March 4, 2020 10:14 AM
To: Yang Zhang <Yang.Zhang at arm.com>
Cc: panama-dev at openjdk.java.net
Subject: Re: [vector] RFR: Support non power-of-two and 2048-bit vector length for gather load/scatter store

Hi,

Than you for applying some consistency to these operations.

VectorShape
—

Can you add a @throws to the JavaDoc of forIndexBitSize and also forBitSize?

ByteVector & ShortVector
—

Also change the mask accepting gather implementation to use the internal mask accepting stOp.  That implementation currently fails if any of the mask bits are not set.

X-Vector.java.template
—

4389 #if[longOrDouble]
4390 
4391         // Mask for gather load
4392         @ForceInline
4393         public final VectorMask<Integer> gatherMask() {
4394             if ((Class<?>) vectorType() == $Type$MaxVector.class) {
4395                 return IntMaxVector.IntMaxMask.GATHER_MASK;
4396             }
4397             throw new AssertionError();
4398         }
4399 #end[longOrDouble]

Hmm.. I am unsure about this since it’s a general method (should be package private) but only works in the case of when the vector type is of the max vector. 

Is there a better way to surface this up since presumably this is called only when the vector type is of  max vector? Maybe be more explicit in the case of "isp.laneCount() != vsp.laneCount()”?

X-VectorBits.java
—

 699 #if[intAndMax]
 700         static final IntMaxMask GATHER_MASK = new IntMaxMask(maskLowerHalf());
 701 #end[intAndMax]

LOWER_HALF_TRUE_MASK is a wordier but more accurate description.

Paul.

> On Feb 13, 2020, at 10:32 PM, Yang Zhang <Yang.Zhang at arm.com> wrote:
> 
> Hi,
> 
> I'm adding support non power-of-two and 2048-bit vector length for gather load/scatter store. 
> Could you please help to review it?
> 
> Webrev: 
> http://cr.openjdk.java.net/~yzhang/vectorapi/vectorapi.2048npow/webrev
> .00/index.html
> No new failures with a full jtreg. 
> 
> In this patch, I made the following changes.
> 1. For gather load/scatter store, int array is used for index map. New index shape calculation function is added.
> For AArch64 SVE, the maximum of index bit size is  (2048/elementSize) * 32. 
> Index increments is (128/elementSize) * 32. So that new index shape calculation function is added.
> 
> 2. Use a gather mask to control index vector loading for long/double gather load/scatter store.
> When vector length is 2048 or non-power-of-two, e.g. SVE, there are 
> index out of bounds failures in long/double gather load test cases.
> Take 2048 as an example, in long gather load (fromArray), indexShape 
> of long species is S_MAX_BIT, and the lane count of long vector is 32.
> When converting long species to int species, indexShape of int species 
> is still S_MAX_BIT, but the lane count of int vector is 64. So when 
> loading index vector (IntVector), unnecessary index data is loaded.
> If current vector is the tail, out of bounds failure happens.
> 
> This failure is only for SVE. For X86, the reason why there isn't such 
> failure is that:
> i)  Byte/Short gather loads aren't intrinsified.
> ii) For X86 AVX512, indexShape(int index map, 8 elements) of 
> long512/double512
> (8 elements) is initialized as S_256_BIT. For SVE with 512-bit vector 
> length, indexShape is initialized as S_256_BIT too. But for SVE 
> 2048-bit and non-power-of-two, there will be failures above.
> 
> 3. Gather load and scatter store is a pair of similar operations. One solution should be applied to them.
> The original java implementations of gather load and scatter store are different.
> 
> Vector                            gather load                             scatter store
> Int or float                    With intrinsification               With intrinsification
> Long or Double            With intrinsification               With intrinsification
>                                       Get indexShape directly        Get indexShape indirectly
>                                       Normal index loading             Special controlled index loading
> Byte or short                Without intrinsification         With intrinsification, no instruction support on x86/arm
> 
> I think gather load and scatter store is a pair of similar operations. One solution should be applied to them.
> Based on above, I use a simple implementation for them.
> Vector                            gather load/scatter store
> Int or float                    With intrinsification
> Long or Double            With intrinsification
>                                       Get indexShape directly
>                                       Special controlled index loading
> Byte or short                Without intrinsification
> If any problem, please let me know.
> 
> 4. Some assertions that vector length is power of two are removed.
> 5. Add comments for gather load intrinsification.
> 
> Regards,
> Yang
> 
>