UTF-8 validation using Vector API

Tue Feb 22 16:44:33 UTC 2022

> Your wonderful document has a lot of ideas, I just want to add that you
don't need to involve memory to replicate a compress operation as you can
compute the cumulative sum of the not of the mask then subtract that from a
iota index vector to achieve a shuffle index vector.

Silly me messed up this part, please ignore it. Another way (should be
correct this time) would be to do a lookup on the value of the mask to find
the correct shuffle

Thanks a lot,
Quan Anh

On Mon, 21 Feb 2022 at 10:27, John Rose <john.r.rose at oracle.com> wrote:

> This is a very helpful case study.
>
> In a recent presentation I put forward UTF8 parsing
> as an important challenge problem, exercising segmented
> expansion. This current experiment could be an exploratory
> step in that direction.
>
> http://cr.openjdk.java.net/~jrose/pres/202202-VectorTopics.pdf
>
> (You message is timely; I have placed a link to it on my page 23,
> and adjusted a few points there to reflect our exchange here.)
>
> It is sad that you didn’t/couldn’t use the slice method directly,
> but used the SLICE_4 constant permutation. That’s worth fixing.
> If there is a “boundary condition” that slice doesn’t handle
> right we should address that too. By boundary condition I mean
> a special case in a nearest neighbor computation at an end of the
> vector where there’s no neighbor, but just the “cliff” at the end.
> In general, I think the Vector API might possibly be friendlier
> to boundary conditions, but I think we have to discover this by
> experience. One area that will push this experience further is
> adopting multi-vectors, which have more interior, but require
> clear and explicit processing at their ends, in many use cases.
> (See p. 9ff.)
>
> Ideally, the optimizer should be able to “see through” a computed
> permutation (including the fromOp index expression) and use
> either a constant-folded permutation vector, or an ad hoc cross
> lane motion instruction (there are very many in both AVX and SVE).
> That would cover slice. Of course, we could also try to make
> slice have it’s own little intrinsic “silo” but that has less of
> a payoff.
>
> You observe that the range checks in the LUT lookups add 15%
> and propose just removing them or incorporating wrap logic.
> There is another more “Java-like” way out of the problem, which
> is to have the optimizer reason about the range-values of the
> index lanes and omit the range check if the index lanes are
> already provably in range. That is the case in your code,
> since before every selectFrom operation you first do a
> lanewise AND with a nibble mask (0x0F or 0x00 all lanes).
> The optimizer should be tracking something like TypeInt
> range information on these lanes, I think, or perhaps
> compute a bitwise support mask for every vector. Using
> a rearrange or select where the indexes are already in
> range should not be penalized with a dynamic check.
> Given logic like the above, a new rearrangeWithWrap
> function is not so much needed, since a user can code
> equivalent functionality with an explicit wrap operation,
> which is then treated like the AND above.
>
> In your code you have explicit copes from constant
> vectors to “live” vector registers in the loop:
>
> var hi2Lut = HI2_LUT.lanewise(VectorOperators.XOR, 0);
>
> It is a bug in the optimizer if it requires these in order
> to get good performance.
>
> Your comment about rematerializing simple ones-and-zeroes on the
> fly in the loop is a good one. The optimizer can do that stuff
> with scalar constants but we are not (I think) there yet with
> vectors. The same point goes for constant lookup-table (LUT)
> bit patterns. It shouldn’t be necessary to “nudge” the optimizer
> to pick them up properly.
>
> I have a request…
>
> It would be good if you were to write a version of your code
> which is as simple and clear as possible, as if the optimizer
> had no bugs but instead was smart about vectors (and their lanes)
> as it is about scalars. The reason this would be good is we would
> have something to debug the optimizer with, because we would want
> to identify the reasons you had to modify your code by hand to
> nudge the optimizer to do something it should not need nudging
> in order to do.
>
> Thank you again for this excellent study.
>
> — John
>
> On 20 Feb 2022, at 0:43, Quân Anh Mai wrote:
>
> Hi,
>
> I have spent this weekend trying to make a UTF-8 validator using the
> Vector
> API and the look-up algorithm proposed by Keiser and Lemire [1]. The
> implementation is straightforward and the result looks good with the
> throughput validating twitter.json reaching 9GB/s. For reference, the
> throughput of my machine running simdjson [2] is around 20.5GB/s. Using
> the
> Vector API and tunning the performance, there are some noticeable points I
> can come up with
> …
>
>