<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div>Thanks Jatin.<br></div><div><br></div><div>On Intel hardware, I am asking for pshufb, which C# makes available as Ssse3.Shuffle. It is cheap (1 cycle latency, good throughput) and widely supported (can effectively be assumed today). The equivalent on ARM hardware is vtbl (or vtbx). Also cheap (when used with 1 register/table) and ubiquitous.<br></div><div><br></div><div>AVX-512 is fantastic. But it seems unwise to me to assume AVX-512 on Intel hardware. Furthermore, these instructions (vpermt2b) are significantly more expensive. They are not dirt cheap, 1 cycle, instructions.<br></div><div><br></div><div>It is also not clear to me that the JDK-24 proposal would even result in a single instruction.</div><div><br></div><div>Still, if the goal here is to make the Java Vector API work well on the assumption that you have AVX-512, then I am fine with such a design decision... but that's not what we have right now. Benchmarking 'rearrange' on Intel AVX-512 hardware indicates that it is a performance burden.<br></div><div><br></div><div><br></div><div><br></div><blockquote type="cite" id="qt" style=""><div><br></div><div>Vectorized table lookup is part of new operation support[1] for JDK-24.<br></div><div><br></div><div>Following Two table lookup instructions available on x86 targets supporting AVX512 feature[s] can be used to optimize lookups.<br></div><div><br></div><div>VPERMT2B[2] : Full Permute of Bytes From Two Tables Overwriting a Table<br></div><div>VPERMT2W/D/Q/PS/PD[3] : Full Permute From Two Tables Overwriting One Table<br></div><div><br></div><div>Best Regards,<br></div><div>Jatin<br></div><div><br></div><div>[1] <a href="https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html">https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html</a><br></div><div>[2] <a href="https://www.felixcloutier.com/x86/vpermt2b">https://www.felixcloutier.com/x86/vpermt2b</a><br></div><div>[3] <a href="https://www.felixcloutier.com/x86/vpermt2w:vpermt2d:vpermt2q:vpermt2ps:vpermt2pd">https://www.felixcloutier.com/x86/vpermt2w:vpermt2d:vpermt2q:vpermt2ps:vpermt2pd</a><br></div><div><br></div><div><br></div><div><br></div><div>From: panama-dev <<a href="mailto:panama-dev-retn@openjdk.org">panama-dev-retn@openjdk.org</a>> On Behalf Of Daniel Lemire<br></div><div>Sent: Wednesday, June 19, 2024 9:48 PM<br></div><div>To: <a href="mailto:panama-dev@openjdk.org">panama-dev@openjdk.org</a><br></div><div>Subject: Vector API : Lack of support for vectorized lookup tables<br></div><div><br></div><div>When parsing strings with SIMD instructions, vectorized table lookup like vtbl (ARM NEON) are important. They are cheap (often run in 1 cycle) and powerful.<br></div><div><br></div><div>Though the details depend on the exact instruction set, the general idea is that you provide a 16-byte table, and a vector with indexes. If the indexes are in the range [0,16), then the byte is retrieved from the 16-byte table. When programming in C#, you can call them directly: e.g., Ssse3.Shuffle or AdvSimd.Arm64.VectorTableLookup. Google relies on Highway (a C++ framework) which offers TableLookupBytes for this purpose.<br></div><div><br></div><div>You can use these instructions to validate and transcode Unicode, to parse DNS records, faster regular expression parsers, base64 codecs, cryptographic hash functions and so forth. The applications are almost endless (see reference at the end). These instructions are effectively ubiquitous in 2024. On x64 systems, SSSE3 and its pshufb instruction (equivalent to ARM's vtbl) are increasingly assumed as a requirement (Windows 11, RedHat, etc.). For example, it is used in Chromium (arguably the most important Web engine) to parse HTML quickly:<br></div><div><a href="https://chromium-review.googlesource.com/c/chromium/src/+/5538407">https://chromium-review.googlesource.com/c/chromium/src/+/5538407</a><br></div><div><br></div><div>The .NET runtime uses these instructions, calling them from C#: E.g., see their base64 encoder... (System/Buffers/Text/Base64Decoder.cs) In C#, we see fast tokenizers and parsers making use of these instructions.<br></div><div><br></div><div>Unfortunately, the Vector API in Java has no equivalent.<br></div><div><br></div><div>It may seem lie rearrange and selectFrom are related, but these are not vectorized lookups. And once compiled, they generate a long flow of instructions. It provides the same functionality, but without the performance. And, of course, the performance is the whole point of using something like the Vector API.<br></div><div><br></div><div>Overall, this lack of access to an important functionality simply cuts off important algorithmic optimizations from Java.<br></div><div><br></div><div><br></div><div><br></div><div>- Daniel<br></div><div><br></div><div><br></div><div>---<br></div><div>References:<br></div><div><br></div><div><br></div><div>Transcoding Billions of Unicode Characters per Second with SIMD Instructions<br></div><div>Software: Practice and Experience 52 (2), 2022<br></div><div><a href="https://arxiv.org/abs/2109.10433">https://arxiv.org/abs/2109.10433</a><br></div><div><br></div><div>Validating UTF-8 In Less Than One Instruction Per Byte<br></div><div>Software: Practice and Experience 51 (5), 2021<br></div><div><a href="https://arxiv.org/abs/2010.03090">https://arxiv.org/abs/2010.03090</a><br></div><div><br></div><div>Faster Base64 Encoding and Decoding using AVX2 Instructions<br></div><div>ACM Transactions on the Web 12 (3), 2018<br></div><div><a href="https://arxiv.org/abs/1704.00605">https://arxiv.org/abs/1704.00605</a><br></div><div><br></div><div>Parsing Gigabytes of JSON per Second<br></div><div>VLDB Journal 28 (6), 2019<br></div><div><a href="https://arxiv.org/abs/1902.08318">https://arxiv.org/abs/1902.08318</a><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></blockquote><div><br></div></body></html>