UTF-8 validation using Vector API

Mon Feb 21 02:27:10 UTC 2022

This is a very helpful case study.

In a recent presentation I put forward UTF8 parsing
as an important challenge problem, exercising segmented
expansion.  This current experiment could be an exploratory
step in that direction.

http://cr.openjdk.java.net/~jrose/pres/202202-VectorTopics.pdf

(You message is timely; I have placed a link to it on my page 23,
and adjusted a few points there to reflect our exchange here.)

It is sad that you didn’t/couldn’t use the `slice` method directly,
but used the SLICE_4 constant permutation.  That’s worth fixing.
If there is a “boundary condition” that `slice` doesn’t handle
right we should address that too. By boundary condition I mean
a special case in a nearest neighbor computation at an end of the
vector where there’s no neighbor, but just the “cliff” at the end.
In general, I think the Vector API might possibly be friendlier
to boundary conditions, but I think we have to discover this by
experience.  One area that will push this experience further is
adopting multi-vectors, which have more interior, but require
clear and explicit processing at their ends, in many use cases.
(See p. 9ff.)

Ideally, the optimizer should be able to “see through” a computed
permutation (including the fromOp index expression) and use
either a constant-folded permutation vector, or an ad hoc cross
lane motion instruction (there are very many in both AVX and SVE).
That would cover slice.  Of course, we could also try to make
slice have it’s own little intrinsic “silo” but that has less of
a payoff.

You observe that the range checks in the LUT lookups add 15%
and propose just removing them or incorporating wrap logic.
There is another more “Java-like” way out of the problem, which
is to have the optimizer reason about the range-values of the
index lanes and omit the range check if the index lanes are
already provably in range.  That is the case in your code,
since before every `selectFrom` operation you first do a
lanewise `AND` with a nibble mask (0x0F or 0x00 all lanes).
The optimizer should be tracking something like `TypeInt`
range information on these lanes, I think, or perhaps
compute a bitwise support mask for every vector.  Using
a rearrange or select where the indexes are already in
range should not be penalized with a dynamic check.
Given logic like the above, a new `rearrangeWithWrap`
function is not so much needed, since a user can code
equivalent functionality with an explicit wrap operation,
which is then treated like the `AND` above.

In your code you have explicit copes from constant
vectors to “live” vector registers in the loop:

> var hi2Lut = HI2_LUT.lanewise(VectorOperators.XOR, 0);

It is a bug in the optimizer if it requires these in order
to get good performance.

Your comment about rematerializing simple ones-and-zeroes on the
fly in the loop is a good one.  The optimizer can do that stuff
with scalar constants but we are not (I think) there yet with
vectors.  The same point goes for constant lookup-table (LUT)
bit patterns.  It shouldn’t be necessary to “nudge” the optimizer
to pick them up properly.

I have a request…

It would be good if you were to write a version of your code
which is as simple and clear as possible, as if the optimizer
had no bugs but instead was smart about vectors (and their lanes)
as it is about scalars.  The reason this would be good is we would
have something to debug the optimizer with, because we would want
to identify the reasons you had to modify your code by hand to
nudge the optimizer to do something it should not need nudging
in order to do.

Thank you again for this excellent study.

— John

On 20 Feb 2022, at 0:43, Quân Anh Mai wrote:

> Hi,
>
> I have spent this weekend trying to make a UTF-8 validator using the Vector
> API and the look-up algorithm proposed by Keiser and Lemire [1]. The
> implementation is straightforward and the result looks good with the
> throughput validating twitter.json reaching 9GB/s. For reference, the
> throughput of my machine running simdjson [2] is around 20.5GB/s. Using the
> Vector API and tunning the performance, there are some noticeable points I
> can come up with
> …