UTF-8 validation using Vector API

Tue Feb 22 13:17:47 UTC 2022

Hi,

Thank you very much for your comprehensive response and insightful
document. The idea of tracking the value interval (as being done with
scalar) of each lane in vectors is really elegant and brilliant as it can
easily take down all the concerns raised above. Constant folding will
surely eliminate range checks of rearrange operations as the upper bound is
surely known statically. Constant shuffle vectors will also allow
recognition of slicing patterns with constant shifts. Although it would be
much harder to detect variable-shift slice operations, that is less of a
problem since with more optimal rearrange we can always do a general
shuffle with just some extra instructions and no range checks. Constant
vectors also allow backend matcher to emit more efficient codes since it
can recognize that look-up tables are just a broadcast of 128-bit lanes and
reduce 5 expensive instructions including a memory load with a single
instruction.

For constant vectors, the situation is very different from a scalar value.
While a scalar can be materialized in at most 2 instructions with no memory
load for floating points and often less than 1 with integers, materializing
a vector may be really expensive both regarding execution time and code
size, and keeping the vector around increases register pressure. So it is a
trade-off and best left to the developers to decide, most of the time they
will hoist the calculation out of the loop themselves and if they decide to
put the calculation inside a loop there may be a real reason to do so.
Given Vector API is a very low-level API, I think it is more suitable to
make the API optimizable upon dedication rather than optimized-ish in
general cases. And a more predictable result would be more welcome than
some ad-hoc magic.

Your wonderful document has a lot of ideas, I just want to add that you
don't need to involve memory to replicate a compress operation as you can
compute the cumulative sum of the not of the mask then subtract that from a
iota index vector to achieve a shuffle index vector.

Thank you very much,
Quan Anh

On Mon, 21 Feb 2022 at 10:27, John Rose <john.r.rose at oracle.com> wrote:

> This is a very helpful case study.
>
> In a recent presentation I put forward UTF8 parsing
> as an important challenge problem, exercising segmented
> expansion. This current experiment could be an exploratory
> step in that direction.
>
> http://cr.openjdk.java.net/~jrose/pres/202202-VectorTopics.pdf
>
> (You message is timely; I have placed a link to it on my page 23,
> and adjusted a few points there to reflect our exchange here.)
>
> It is sad that you didn’t/couldn’t use the slice method directly,
> but used the SLICE_4 constant permutation. That’s worth fixing.
> If there is a “boundary condition” that slice doesn’t handle
> right we should address that too. By boundary condition I mean
> a special case in a nearest neighbor computation at an end of the
> vector where there’s no neighbor, but just the “cliff” at the end.
> In general, I think the Vector API might possibly be friendlier
> to boundary conditions, but I think we have to discover this by
> experience. One area that will push this experience further is
> adopting multi-vectors, which have more interior, but require
> clear and explicit processing at their ends, in many use cases.
> (See p. 9ff.)
>
> Ideally, the optimizer should be able to “see through” a computed
> permutation (including the fromOp index expression) and use
> either a constant-folded permutation vector, or an ad hoc cross
> lane motion instruction (there are very many in both AVX and SVE).
> That would cover slice. Of course, we could also try to make
> slice have it’s own little intrinsic “silo” but that has less of
> a payoff.
>
> You observe that the range checks in the LUT lookups add 15%
> and propose just removing them or incorporating wrap logic.
> There is another more “Java-like” way out of the problem, which
> is to have the optimizer reason about the range-values of the
> index lanes and omit the range check if the index lanes are
> already provably in range. That is the case in your code,
> since before every selectFrom operation you first do a
> lanewise AND with a nibble mask (0x0F or 0x00 all lanes).
> The optimizer should be tracking something like TypeInt
> range information on these lanes, I think, or perhaps
> compute a bitwise support mask for every vector. Using
> a rearrange or select where the indexes are already in
> range should not be penalized with a dynamic check.
> Given logic like the above, a new rearrangeWithWrap
> function is not so much needed, since a user can code
> equivalent functionality with an explicit wrap operation,
> which is then treated like the AND above.
>
> In your code you have explicit copes from constant
> vectors to “live” vector registers in the loop:
>
> var hi2Lut = HI2_LUT.lanewise(VectorOperators.XOR, 0);
>
> It is a bug in the optimizer if it requires these in order
> to get good performance.
>
> Your comment about rematerializing simple ones-and-zeroes on the
> fly in the loop is a good one. The optimizer can do that stuff
> with scalar constants but we are not (I think) there yet with
> vectors. The same point goes for constant lookup-table (LUT)
> bit patterns. It shouldn’t be necessary to “nudge” the optimizer
> to pick them up properly.
>
> I have a request…
>
> It would be good if you were to write a version of your code
> which is as simple and clear as possible, as if the optimizer
> had no bugs but instead was smart about vectors (and their lanes)
> as it is about scalars. The reason this would be good is we would
> have something to debug the optimizer with, because we would want
> to identify the reasons you had to modify your code by hand to
> nudge the optimizer to do something it should not need nudging
> in order to do.
>
> Thank you again for this excellent study.
>
> — John
>
> On 20 Feb 2022, at 0:43, Quân Anh Mai wrote:
>
> Hi,
>
> I have spent this weekend trying to make a UTF-8 validator using the
> Vector
> API and the look-up algorithm proposed by Keiser and Lemire [1]. The
> implementation is straightforward and the result looks good with the
> throughput validating twitter.json reaching 9GB/s. For reference, the
> throughput of my machine running simdjson [2] is around 20.5GB/s. Using
> the
> Vector API and tunning the performance, there are some noticeable points I
> can come up with
> …
>
>