Vector API : Lack of support for vectorized lookup tables
Daniel Lemire
daniel at lemire.me
Thu Jun 20 15:04:14 UTC 2024
Let me elaborate with an example so that what I mean is clear. August Nagro wrote a fast UTF-8 validator using Java Vector (see https://github.com/AugustNagro/utf8.java). A key step in the function is as follows:
ByteVector prev1 = prevInputBlock.slice(species.length() - 1, input);
ByteVector byte1High = prev1.lanewise(LSHR, 4).selectFrom(lut.byte1HighLookup());
ByteVector byte1Low = prev1.and((byte) 0x0f).selectFrom(lut.byte1LowLookup());
ByteVector byte2High = input.lanewise(LSHR, 4).selectFrom(lut.byte2HighLookup());
ByteVector specialCases = byte1High.and(byte1Low).and(byte2High);
The expectation is that selectFrom should be a single cheap instruction (on all systems as long as ByteVector.SPECIES_PREFERRED is used). But that is not the case.
Unfortunately, that is the case. The net result is that instead of validating UTF-8 streams at very high speed, we barely reach 3 GB/s... Using SIMD instructions we can fully parse and validate a JSON document (see simdjson) for that kind of speed. It is likely not worth the effort.
I implemented the equivalent to August's code in C# and we are fully 10 x faster.
https://github.com/simdutf/SimdUnicode
- Daniel
On Wed, Jun 19, 2024, at 12:17, Daniel Lemire wrote:
> When parsing strings with SIMD instructions, vectorized table lookup like vtbl (ARM NEON) are important. They are cheap (often run in 1 cycle) and powerful.
>
>
> Though the details depend on the exact instruction set, the general idea is that you provide a 16-byte table, and a vector with indexes. If the indexes are in the range [0,16), then the byte is retrieved from the 16-byte table. When programming in C#, you can call them directly: e.g., Ssse3.Shuffle or AdvSimd.Arm64.VectorTableLookup. Google relies on Highway (a C++ framework) which offers TableLookupBytes for this purpose.
>
> You can use these instructions to validate and transcode Unicode, to parse DNS records, faster regular expression parsers, base64 codecs, cryptographic hash functions and so forth. The applications are almost endless (see reference at the end). These instructions are effectively ubiquitous in 2024. On x64 systems, SSSE3 and its pshufb instruction (equivalent to ARM's vtbl) are increasingly assumed as a requirement (Windows 11, RedHat, etc.). For example, it is used in Chromium (arguably the most important Web engine) to parse HTML quickly:
> https://chromium-review.googlesource.com/c/chromium/src/+/5538407
>
> The .NET runtime uses these instructions, calling them from C#: E.g., see their base64 encoder... (System/Buffers/Text/Base64Decoder.cs) In C#, we see fast tokenizers and parsers making use of these instructions.
>
> Unfortunately, the Vector API in Java has no equivalent.
>
> It may seem lie rearrange and selectFrom are related, but these are not vectorized lookups. And once compiled, they generate a long flow of instructions. It provides the same functionality, but without the performance. And, of course, the performance is the whole point of using something like the Vector API.
>
> Overall, this lack of access to an important functionality simply cuts off important algorithmic optimizations from Java.
>
>
>
> - Daniel
>
>
> ---
> References:
>
>
> Transcoding Billions of Unicode Characters per Second with SIMD Instructions
> Software: Practice and Experience 52 (2), 2022
> https://arxiv.org/abs/2109.10433
>
> Validating UTF-8 In Less Than One Instruction Per Byte
> Software: Practice and Experience 51 (5), 2021
> https://arxiv.org/abs/2010.03090
>
> Faster Base64 Encoding and Decoding using AVX2 Instructions
> ACM Transactions on the Web 12 (3), 2018
> https://arxiv.org/abs/1704.00605
>
> Parsing Gigabytes of JSON per Second
> VLDB Journal 28 (6), 2019
> https://arxiv.org/abs/1902.08318
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240620/e5c15bb7/attachment-0001.htm>
More information about the panama-dev
mailing list