[vectorIntrinsics] Issue to VectorAPI "selectFrom" for byte type with Arm SVE 2048-bits

Wed Nov 4 03:27:37 UTC 2020

Hi John,

Thanks for looking at the index related issues.

> The more permanent fix is to widen the shuffle lanes to the next size up that is big enough to index the data lanes, which for the foreseeable future is 16 bits (from 8), and that can be signed until we get vectors of size 2^16.  So, signed indexes are just fine, and they are sometimes useful (as explained in the docs) to express exceptional conditions and/or to select from a second data input.

Currently we met the similar issues for byte when using the following APIs:
 - VectorMask.fromArray
 - VectorShuffle  usages for  "rearrange/slice/unslice"
 - VectorMask.indexInRange
 - selectFrom

Regarding to the first three APIs, we can fix the issues for them by widening the shuffle lanes to the next size up. This is possible by modifying the Java implementation codes for the APIs. 
And actually we have made several patches for them. One of them (see  https://mail.openjdk.java.net/pipermail/panama-dev/2020-August/010384.html.) is still waiting for review.

However, since the "this" vector for "selectFrom" comes from the user, it might need the user to do the widening job if the API needs a byte vector for byte. So that's why I thought if we can change something to the API itself before.

> The most robust way, IMO, to widen shuffle lanes double size when necessary is to allow *synthetic* double-size vector shapes.  For example, allow a 1024-bit shape on AVX-512, or a 512-bit shape on AVX-2, either one consisting of a pair of native registers of the large size.  Even in AVX-512, the 512-bit shape which consists of two 256-bit vectors should
> *not* be conflated with the native 512-bit shape, but rather be its own option
>
> These multi-vector synthetic shapes amount to user selectable unrolling of loops, in addition to helping us out of tight spots when a lane width needs to be doubled.
>
> The C bindings for SVE have synthetic vector types for 2x, 3x, and 4x (IIRC).  I think this would be a fine thing to do in our Vector API as well.  I suggest names of the form S_{N}_BIT_X{M}, where {M} ranges in 2..4 or 2..5.  (Yes, odd sizes are probably helpful here.  It’s another use case for synthetics, when you have 3-tuples interleaved in memory but want to vectorize.)

That's a good idea that allowing "double-size vector shapes". Is there any plan to make it possible in Vector API in future? Thanks!

-----Original Message-----
From: John Rose <john.r.rose at oracle.com> 
Sent: Wednesday, November 4, 2020 8:11 AM
To: Paul Sandoz <paul.sandoz at oracle.com>
Cc: Xiaohong Gong <Xiaohong.Gong at arm.com>; panama-dev at openjdk.java.net; nd <nd at arm.com>
Subject: Re: [vectorIntrinsics] Issue to VectorAPI "selectFrom" for byte type with Arm SVE 2048-bits

On Oct 26, 2020, at 4:47 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> Given that this.selectFrom(Vector that) is equivalent to that.rearrange(this.toShuffle()) we might be able to remove selectFrom(Vector ), but we likely need to teach C2 to recognize the pattern of the latter and elide the shuffle conversion if there is a more direct instruction.

There are two forms for shuffling selection for expressiveness.
In some algorithms, the shuffle is fixed, and in others the data being shuffled is fixed; in any case the code is often easier to read when the data-driven part is written in fluent notation, on the left.
That’s why there are two ways to say the same thing, because the programmer might prefer to emphasize either part of the operation (shuffle or data) as the principal part.

That said, this problem of narrow lanes not being able index
(exponentially) long vectors is independent of which side of the dot the shuffle vector shows up on.  This more fundamental problem cannot be overcome by a syntax shift.  You sometimes have to expand the lane width of the shuffle (permutation vector) beyond the data lane width.  For now, the problem shows up with byte data lanes indexed by byte shuffle lanes; making the shuffle lanes unsigned delays the inevitable by a factor of 2.

The more permanent fix is to widen the shuffle lanes to the next size up that is big enough to index the data lanes, which for the foreseeable future is 16 bits (from 8), and that can be signed until we get vectors of size 2^16.  So, signed indexes are just fine, and they are sometimes useful (as explained in the docs) to express exceptional conditions and/or to select from a second data input.

The most robust way, IMO, to widen shuffle lanes double size when necessary is to allow *synthetic* double-size vector shapes.  For example, allow a 1024-bit shape on AVX-512, or a 512-bit shape on AVX-2, either one consisting of a pair of native registers of the large size.  Even in AVX-512, the 512-bit shape which consists of two 256-bit vectors should
*not* be conflated with the native 512-bit shape, but rather be its own option.

These multi-vector synthetic shapes amount to user selectable unrolling of loops, in addition to helping us out of tight spots when a lane width needs to be doubled.

The C bindings for SVE have synthetic vector types for 2x, 3x, and 4x (IIRC).  I think this would be a fine thing to do in our Vector API as well.  I suggest names of the form S_{N}_BIT_X{M}, where {M} ranges in 2..4 or 2..5.  (Yes, odd sizes are probably helpful here.  It’s another use case for synthetics, when you have 3-tuples interleaved in memory but want to vectorize.)

— John