Vector API : Lack of support for vectorized lookup tables

Tue Sep 17 21:05:10 UTC 2024

In the C# code I see some familiar things in strange C# clothing.
There are a number of length-256 tables accessed by (unsigned?)
bytes or by chars that have been guarded to the range 0..255.
Access to them is not range-checked (pointer arithmetic + load).

Here is what I think is a representative bit:

> Vector128<byte> compactmask = Vector128.Load(tablePtr + pop1 * 8);
> Vector128<byte> answer = 
> AdvSimd.Arm64.VectorTableLookup(pruned.AsByte(), compactmask);

Using a scalar value “pop1” as index, you need to load a permutation
vector (or some such thing) from a static table.  Then you apply that
vector to a data vector “pruned” to transform to the desired answer.

I think what we are building now is a primitive VectorTableLookup
primitive in our Vector API.  There is a question about how to
handle out-of-range indexes in the permutation (lookup/table)
vector; what is the policy for VectorTableLookup?  Undefined
behavior?  Masking to power of two?

The compactmask thingy (it’s not a mask is it?) appears to be a
permutation vector, what we’d call a VectorShuffle, extracted
from a static table.  We don’t have a fast way to do this yet,
because our memory access story is accompanied by low-level
safety checks almost everywhere.

I think, as a general principle for APIs to be used for
these algorithms, we should be masking both permutation indexes,
and table indexes, to powers of two, as a way of avoiding more
complicated range checks.

Thus, if we load an unconditioned vector from memory and use it
as a shuffle (LUT) then we may need to apply a VPAND to it to force
the indexes into the right place.  And when we load such a thing,
either a shuffle or some other random datum, we should consider
reducing the index using a scalar AND operation, instead of doing
the usual Java thing of range-check-and-throw.

A VectorShuffle object is preconditioned, range checked at construction.
But if you convert it from a vector, you have to do the range check
dynamically. I think maybe we need a conversion that boils down to
a VPAND, and sometimes (if the JIT sees a suitably conditioned input)
a NOP.

When such things (vector or shuffle) are loaded from memory, either
they are pointer-indirected (so shuffles have their preconditioning),
or if they are flat, they are loaded from segments or arrays, and
there is no preconditioning.

I think maybe we need a new kind of object which acts as a static
fixed-size (power of two size) table of VectorShuffle objects,
all of which are preconditioned for instant use with whatever VPERM
operation the hardware supplies.  A value loaded from such a table
would not need a VPAND to ensure correct index format.

Moreover I’m envisioning statically definable tables of BOTH vectors
per se and their shuffles (validated permutation/rearrange/lookup 
tables).
And implementations of them for each vector species of course.

This is a fair amount of design work.  As a shortcut, I think we should
support appropriate tools for third parties (like your team Daniel)
to build such things.  I think an unchecked memory segment might be
a good place to store such tables, and for prototyping one might
consider digging into internal (post-guard) access subroutines
like fromMemorySegment0Template.

I think we might also need an unsafe bridge between vectors and 
shuffles,
one which converts bitwise between them, and does not bother to 
re-impose
range checks on the shuffle lanes (permutation indexes).  That way,
a fast, flat container for vectors could be wrapped as a fast, flat
container for shuffles.

There are a number of bits here which are hard to access (buried
in the implementation) or even missing. Eventually I’d like to
build on top of them (perhaps adding missing bits) to expose
static lookup tables that can serve vectors and shuffles.

And if that is useful, one might also consider a third kind of
table that is the target of a gathering store, where each table
entry is indexed by both lane position and lane contents.
Like the other two, it would internally use unsafe access
and guard against OOB errors by masking down the query index.
In this case, the masking would be applied across all the
lanes, since it is a multi-index operation.

Finally, such static tables are a place where Java’s lambdas
could shine, as compact specifications of the table contents.
When reading the above C# code it is clear to me that the
tables were not manually written, but emitted by some ad
hoc program.  A good Java way to build such tables is to
supply a constructor argument that is a lambda which is
iterated over the table’s index space to obtain each table
value.

Let’s think some more about this stuff.  I think static
tables are an under-appreciated tool for these kinds of
algorithms.

— John

On 16 Sep 2024, at 11:08, Daniel Lemire wrote:

> For reference, we have published a C# library for Base64 decoding:
>
> https://github.com/simdutf/SimdBase64
>
> This is another example of a library that cannot be written (with good 
> performance) in Java right now.
>
> I thought it might be useful to provide another example.
>
> (Note that I am *not* biased in favour of C# nor am I promoting C#. I 
> am only using C# as a reference due to the obvious similarity between 
> C# and Java.)
>
> - Daniel
>
> On Fri, Jun 28, 2024, at 12:51, Paul Sandoz wrote:
>>
>>
>>> On Jun 28, 2024, at 9:36 AM, Chris Hegarty <chegar999 at gmail.com> 
>>> wrote:
>>>
>>> Hi,
>>>
>>>
>>> On 21/06/2024 21:11, Paul Sandoz wrote:
>>>> Hi Daniel,
>>>> Thanks for the email discussion. I think we can do better and get 
>>>> to, or closer to, your expectation on the generation of a single 
>>>> instruction.
>>>> At the moment rearrange performs bounds checks on the shuffle 
>>>> argument (checking for exceptional indexes), and selectFrom 
>>>> converts a vector to a shuffle and calls rearrange.
>>>
>>> ...
>>>
>>> We've run into this separately too. If memory serves me correctly, 
>>> it was slice(int) using rearrange internally to only then do 
>>> expensive bounds checks on the implementation specific shuffle. ( I 
>>> can try to dig out the details if I've mixed up something here. )
>>>
>>
>> Please share if you can (I recall you may have mentioned this to me 
>> in the past, and I probably said “hmm…. this is tricky"). 
>> Benchmarks/code can really help focus our efforts.
>>
>>
>>> Bounds checks on input at the API level is expected, but not when 
>>> the rearrange is an implementation detail.
>>>
>>> More generally, the system property has been around for a while now, 
>>> and from my understanding it's a workaround to the compiler not yet 
>>> being able to eliminate the checks. Firstly, is this understanding 
>>> correct? And if so, is the intention to eventually remove it when it 
>>> is no longer needed?
>>>
>>
>> The system property is for debugging, it’s not intended for 
>> production use. Ideally it would go away, but it’s a useful "break 
>> glass in emergency". My usage was intended to show what might be 
>> possible if we can enhance the API and/or implementation 
>> appropriately.
>>
>> Paul.
>>
>>> ...
>>>
>>>> To be fair, it is fine if folks say "No, we don't want to do this 
>>>> with Java Vector, we don't want to optimize indexing in Lucene, we 
>>>> want to do X".
>>>
>>> Lucene is already using the Panama Vector API to improve the 
>>> performance of vector similarity comparisons - for vector search. We 
>>> would love to use it for more, but are largely hampered by the above 
>>> issue. A less ideal situation is that in Elasticsearch [1] we're now 
>>> using Panama FFI to workaround the limitations of the Vector API by 
>>> linked to our own optimized implementations. This not ideal, and 
>>> effectively robs Lucene of these potential performance benefits 
>>> (since it's non-trivial to ship Lucene with a native lib)
>>>
>>> -Chris.
>>>
>>> [1] 
>>> https://urldefense.com/v3/__https://github.com/elastic/elasticsearch/tree/main/libs/simdvec__;!!ACWV5N9M2RV99hQ!OZZEZlRHgbdC6YncLGIyD41hdpE2pHOo9-alCbMm7qzTPbmxPPLN1-IzNwnemXh3PnfOwsjdiwb5Qzrw4EM$
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240917/e3f2cd79/attachment-0001.htm>