Vector API : Lack of support for vectorized lookup tables

Tue Jun 25 17:21:05 UTC 2024

Hi Daniel,

Appreciate that you are pointing out the current limitations, this and work by August and Lucene really helps us identify problems and improve the API and its performance incrementally. As August also concludes:

"I think someone could implement simd-json in Java if they wanted to. Would it be fast? At least not for now.”

Hopefully in a subsequent release :-)

Paul.

> On Jun 25, 2024, at 7:16 AM, Daniel Lemire <daniel at lemire.me> wrote:
> 
> Bound checking is not relevant in the context vectorized classification.
> 
> As an example, look at this project https://github.com/simdutf/SimdUnicode where we coded UTF-8 validation in C#. It is not as fast as C code, but we reach > 10 GB/s on a 2 GHz Ice Lake server.
> 
> Exactly the same approach was coded in Java https://github.com/AugustNagro/utf8.java by August Nagro (who is on this list), he concluded : « Would have been nice if performance was a success story, but failure is educational. »
> 
> See the difference?
> 
> To be fair, Oracle Graal has the same algorithm in 'Java'... but it is not, as far as I can tell, using the Java Vector API:
> 
> https://github.com/oracle/graal/blob/01f3be02bb8e99e910959c1f266a69f651b5dc08/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/lir/aarch64/AArch64CalcStringAttributesOp.java#L532
> 
> See how they do it here: 
> 
> https://github.com/oracle/graal/blob/01f3be02bb8e99e910959c1f266a69f651b5dc08/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/lir/aarch64/AArch64CalcStringAttributesOp.java#L769
> 
> With Geoff Langdale (from Intel, see his site at https://branchfree.org) and others, we wrote a really fast JSON parser (see https://simdjson.org) full of these techniques. You can parse JSON files at nearly 10 GB/s on some systems (see https://openbenchmarking.org/test/pts/simdjson-2.0.0). You can write more or less the same code in Java, and it has been done... https://github.com/simdjson/simdjson-java by Piotr Rżysko who I also think is on this list... but the performance is not the same.
> 
> These are well documented techniques, they have been deployed for years in production systems. E.g., see Intel's Hyperscan for example (https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-hyperscan.html).
> 
> 
> I *think* you should design the Java Vector API so that you can write something like Intel's Hyperscan in Java. You certainly can do it in .NET/C# given enough effort. I *think* that you should be able to have a JSON parser like simdjson-java get close to C++ performance...  
> 
> To be fair, it is fine if folks say "No, we don't want to do this with Java Vector, we don't want to optimize indexing in Lucene, we want to do X".
> 
> The reason I came here is to point out that there is a wide range of applications for which Java Vector is currently poorly suited.
> 
> To be clear, **it is entirely fine**: I am not here to tell people what to do. I am just here to point out significant limitations.
> 
> - Daniel
> 
>> Thanks Daniel,
>> 
>> Yes, C# does offer Hardware intrinsics[1] which appear to give greater flexibility to user to code at near assembly level but onus of out of bounds index checking lie on user. 
>> VectorAPI in default mode performs index out of bounds checking, with bound checking disabled the generated JIT code does emit VSHUFB.
>> 
>> CPROMPT>java -Xbatch -XX:-TieredCompilation -XX:CompileCommand=Print,rearrange::shuffle -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0 -XX:UseAVX=2 --add-modules=jdk.incubator.vector -XX:MaxVectorSize=16 -cp . rearrange
>>   
>> public static VectorSpecies<Byte> SPECIES = ByteVector.SPECIES_PREFERRED;
>> 
>>   public static void shuffle(byte [] res, byte [] arr, int [] idx) {
>>      for(int i = 0 ; i < SPECIES.loopBound(res.length); i+=SPECIES.length()) {
>>        ByteVector FV = ByteVector.fromArray(SPECIES, arr, i);
>>        FV.rearrange(VectorShuffle.fromArray(SPECIES, idx, 0)).intoArray(res,i);
>>      }
>>   }
>> 
>>   0x00007f7194721c87:   vmovdqu 0x10(%r12,%r11,8),%xmm0
>>   0x00007f7194721c8e:   vmovdqu 0x20(%rsp),%xmm1
>>   0x00007f7194721c94:   vpshufb %xmm0,%xmm1,%xmm1
>>   0x00007f7194721c99:   mov    (%rsp),%r11
>>   0x00007f7194721c9d:   vmovdqu %xmm1,0x10(%r11,%rbp,1)
>> 
>> JIT compiler ensures to emit optimized instruction sequence based on target features, what I meant with ""One can always customize his kernel to pick most efficient intrinsic implementation for a given target".
>> was that user may pick a different vectorization algorithm for different vector concrete vector size and prefer guarding it with explicit feature checks / vector sizes, for example[2]  
>> 
>> [1] https://devblogs.microsoft.com/dotnet/dotnet-8-hardware-intrinsics/
>> [2] PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI
>> https://github.com/apache/parquet-java/pull/1011/files#diff-d2b0d794dd3e2c9476f435fffbc9352eacb7c58070227d2193c58abb39e513d3R123
>> 
>> Best Regards,
>> Jatin
>> 
>> From: Daniel Lemire <daniel at lemire.me> 
>> Sent: Friday, June 21, 2024 8:10 PM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: panama-dev at openjdk.org
>> Subject: Re: Vector API : Lack of support for vectorized lookup tables
>> 
>> "One can always customize his kernel to pick most efficient intrinsic implementation for a given target, I see some feature level check in C# implementation."    
>> 
>> The C# .NET SIMD API has three levels of abstraction depending on how much control you need. At the lowest level, you can call specific ARM NEON instructions, for example.
>> 
>> 
>> 
>> Hi Daniel,
>> 
>> Thanks for your explanation.
>> 
>> Yes, 512 bit ByteVector rearrange is optimized for x86 targets with AVX512_VBMI feature, since it offers are direct 'VPERMB' instruction for cross lane shuffle. 
>> 
>> I ran August's benchmark on Sapphire Rapids and see ~15x improvements over scalar implementation.
>> 
>> SPR>java -Xlog:cpu* --version
>> [0.001s][info][os,cpu] CPU: total 112 (initial active 112) (56 cores per cpu, 2 threads per core) family 6 model 143 stepping 6 microcode 0x2b0004d0, cx8, cmov, fxsr, ht, mmx, 3dnowpref, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, lzcnt, tsc, tscinvbit, avx, avx2, aes, erms, clmul, bmi1, bmi2, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, fma, vzeroupper, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflush, clflushopt, clwb, avx512_vbmi2, *avx512_vbmi*, serialize, rdtscp, rdpid, fsrm, gfni, avx512_bitalg, f16c, pku, ospke, cet_ibt, cet_ss, avx512_ifma
>> [0.024s][info][os,cpu] CPU Model and flags from /proc/cpuinfo:
>> [0.024s][info][os,cpu] model name       : Intel(R) Xeon(R) Platinum 8480+
>> [0.024s][info][os,cpu] flags            : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfn
>>  
>> 
>> Benchmark                    (testFile)       Mode  Cnt  Score     Error  Units
>> BenchJDK.jdk                 /twitter.json  thrpt    2   1502.489          ops/s
>> BenchJDK.scalar            /twitter.json  thrpt    2   2261.224          ops/s
>> BenchJDK.vector_128  /twitter.json  thrpt    2  16708.845          ops/s
>> BenchJDK.vector_256  /twitter.json  thrpt    2  21155.730          ops/s
>> BenchJDK.vector_512  /twitter.json  thrpt    2  31976.168          ops/s
>> 
>> On older Xeons (prior to Icelake Server) 512-bit ByteVector shuffle was not-intensified originally, JDK-8290322 enabled its intrinsification on targets support AVX512BW features. 
>> 
>> We see around ~10x gain with 256-bit species which matches to the C# implementation. 
>> 
>> If user's SIMD kernel is specifically crafted with 512-bit species, then compiler emits code favouring AVX512 ISA.  One can always customize his kernel to pick most efficient intrinsic implementation for a given target, I see some feature level check in C# implementation[1].    
>> 
>> Best Regards,
>> Jatin 
>> 
>> [1] https://github.com/simdutf/SimdUnicode/blob/main/src/UTF8.cs#L30
>> 
>> From: Daniel Lemire <mailto:daniel at lemire.me> 
>> Sent: Thursday, June 20, 2024 9:11 PM
>> To: Bhateja, Jatin <mailto:jatin.bhateja at intel.com>; mailto:panama-dev at openjdk.org
>> Subject: Re: Vector API : Lack of support for vectorized lookup tables
>> 
>> Thanks Jatin.
>> 
>> On Intel hardware, I am asking for pshufb, which C# makes available as Ssse3.Shuffle. It is cheap (1 cycle latency, good throughput) and widely supported (can effectively be assumed today). The equivalent on ARM hardware is vtbl (or vtbx). Also cheap (when used with 1 register/table) and ubiquitous.
>> 
>> AVX-512 is fantastic. But it seems unwise to me to assume AVX-512 on Intel hardware. Furthermore, these instructions (vpermt2b) are significantly more expensive. They are not dirt cheap, 1 cycle, instructions.
>> 
>> It is also not clear to me that the JDK-24 proposal would even result in a single instruction.
>> 
>> Still, if the goal here is to make the Java Vector API work well on the assumption that you have AVX-512, then I am fine with such a design decision... but that's not what we have right now. Benchmarking 'rearrange' on Intel AVX-512 hardware indicates that it is a performance burden.
>> 
>> 
>> 
>> 
>> Vectorized table lookup is part of new operation support[1] for JDK-24.
>> 
>> Following Two table lookup instructions available on x86 targets supporting AVX512 feature[s] can be used to optimize lookups.
>> 
>> VPERMT2B[2] : Full Permute of Bytes From Two Tables Overwriting a Table
>> VPERMT2W/D/Q/PS/PD[3] : Full Permute From Two Tables Overwriting One Table
>> 
>> Best Regards,
>> Jatin
>> 
>> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html
>> [2] https://www.felixcloutier.com/x86/vpermt2b
>> [3] https://www.felixcloutier.com/x86/vpermt2w:vpermt2d:vpermt2q:vpermt2ps:vpermt2pd
>> 
>> 
>> 
>> From: panama-dev <mailto:mailto:panama-dev-retn at openjdk.org> On Behalf Of Daniel Lemire
>> Sent: Wednesday, June 19, 2024 9:48 PM
>> To: mailto:mailto:panama-dev at openjdk.org
>> Subject: Vector API : Lack of support for vectorized lookup tables
>> 
>> When parsing strings with SIMD instructions, vectorized table lookup like vtbl (ARM NEON) are important. They are cheap (often run in 1 cycle) and powerful.
>> 
>> Though the details depend on the exact instruction set, the general idea is that you provide a 16-byte table, and a vector with indexes. If the indexes are in the range [0,16), then the byte is retrieved from the 16-byte table. When programming in C#, you can call them directly: e.g., Ssse3.Shuffle or AdvSimd.Arm64.VectorTableLookup. Google relies on Highway (a C++ framework) which offers TableLookupBytes for this purpose.
>> 
>> You can use these instructions to validate and transcode Unicode, to parse DNS records, faster regular expression parsers, base64 codecs, cryptographic hash functions and so forth. The applications are almost endless (see reference at the end). These instructions are effectively ubiquitous in 2024. On x64 systems, SSSE3 and its pshufb instruction (equivalent to ARM's vtbl) are increasingly assumed as a requirement (Windows 11, RedHat, etc.). For example, it is used in Chromium (arguably the most important Web engine) to parse HTML quickly:
>> https://chromium-review.googlesource.com/c/chromium/src/+/5538407
>> 
>> The .NET runtime uses these instructions, calling them from C#: E.g., see their base64 encoder... (System/Buffers/Text/Base64Decoder.cs) In C#, we see fast tokenizers and parsers making use of these instructions.
>> 
>> Unfortunately, the Vector API in Java has no equivalent.
>> 
>> It may seem lie rearrange and selectFrom are related, but these are not vectorized lookups. And once compiled, they generate a long flow of instructions. It provides the same functionality, but without the performance. And, of course, the performance is the whole point of using something like the Vector API.
>> 
>> Overall, this lack of access to an important functionality simply cuts off important algorithmic optimizations from Java.
>> 
>> 
>> 
>> - Daniel
>> 
>> 
>> ---
>> References:
>> 
>> 
>> Transcoding Billions of Unicode Characters per Second with SIMD Instructions
>> Software: Practice and Experience 52 (2), 2022
>> https://arxiv.org/abs/2109.10433
>> 
>> Validating UTF-8 In Less Than One Instruction Per Byte
>> Software: Practice and Experience 51 (5), 2021
>> https://arxiv.org/abs/2010.03090
>> 
>> Faster Base64 Encoding and Decoding using AVX2 Instructions
>> ACM Transactions on the Web 12 (3), 2018
>> https://arxiv.org/abs/1704.00605
>> 
>> Parsing Gigabytes of JSON per Second
>> VLDB Journal 28 (6), 2019
>> https://arxiv.org/abs/1902.08318