RFR: 8306136: [vectorapi] Intrinsics of VectorMask.laneIsSet()

Paul Sandoz psandoz at openjdk.org
Wed Jun 21 19:07:04 UTC 2023


On Mon, 29 May 2023 11:53:03 GMT, Eric Liu <eliu at openjdk.org> wrote:

> VectorMask.laneIsSet() [1] is implemented based on VectorMask.toLong() [2], and it's performance highly depends on the intrinsification of toLong(). However, if `toLong()` is failed to intrinsify, on some architectures or unsupported species, it's much more expensive than pure getBits(). Besides, some CPUs (e.g. with Arm Neon) may not have efficient instructions to implementation toLong(), so we propose to intrinsify VectorMask.laneIsSet separately.
> 
> This patch optimize laneIsSet() by calling the existing intrinsic method VectorSupport.extract(), which actually does not introduce new intrinsic method. The C2 compiler intrinsification logic to support _VectorExtract has also been extended to better support laneIsSet(). It tries to extract the mask's lane value with an ExtractUB node if the hardware backend supports it. While on hardware without ExtractUB backend support , c2 will still try to generate toLong() related nodes, which behaves the same as before the patch.
> 
> Key changes in this patch:
> 
> 1. Reuse intrinsic `VectorSupport.extract()` in Java side. No new intrinsic method is introduced.
> 2. In compiler, `ExtractUBNode` is generated if backend support is. If not, the original "toLong" pattern is generated if it's implemented. Otherwise, it uses the default Java `getBits[i]` rather than the expensive and complicated toLong() based implementation.
> 3. Enable `ExtractUBNode` on AArch64 to extract the lane value for a vector mask in compiler, together with changing its bottom type to TypeInt::BOOL. This helps optimize the conditional selection generated by
> 
>    ```
> 
>        public boolean laneIsSet(int i) {
>            return VectorSupport.extract(..., defaultImpl) == 1L;
>        }
> 
>    ```
> 
> [Test]
> hotspot:compiler/vectorapi and jdk/incubator/vector passed.
> 
> [Performance]
> 
> Below shows the performance gain on 128-bit vector size Neon machine. For 64 and 128 SPECIES, the improvment caused by this intrinsics. For other SPECIES which can not be intrinfied, performance gain comes from the default Java implementation changes, i.e. getBits[i] vs. toLong().
> 
> 
> Benchmark                               Gain (after/before)
> microMaskLaneIsSetByte128_con           2.47
> microMaskLaneIsSetByte128_var           1.82
> microMaskLaneIsSetByte256_con           3.01
> microMaskLaneIsSetByte256_var           3.04
> microMaskLaneIsSetByte512_con           4.83
> microMaskLaneIsSetByte512_var           4.86
> microMaskLaneIsSetByte64_con            1.57
> microMaskLaneIsSetByte64_var            1.18...

Getting crashes on linux-x64-debug when using these VM options:

-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting

For a test tag of `tier3-vector-avx512` when running tests for `open/test/jdk/:jdk_vector`.

Relevant bits from the HS error log file:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/cd627e65-f015-4fb1-a1d2-b6c9b8127f98-S9618/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/24abb99d-ff0d-447b-9153-bb7048d6d487/runs/d1fc7a0c-634f-4df4-9d5f-6a464bb177b9/workspace/open/src/hotspot/share/opto/vectorIntrinsics.cpp:2586), pid=1090716, tid=1090731
#  assert(!Matcher::has_predicated_vectors()) failed: should be
#
...
Current CompileTask:
C2:   9402  989    b        jdk.incubator.vector.Int256Vector$Int256Mask::laneIsSet (38 bytes)

Stack: [0x00007fa325bfc000,0x00007fa325cfd000],  sp=0x00007fa325cf78f0,  free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1815626]  LibraryCallKit::inline_vector_extract()+0xc26  (vectorIntrinsics.cpp:2586)
V  [libjvm.so+0x121cc94]  LibraryIntrinsic::generate(JVMState*)+0x1c4  (library_call.cpp:117)
V  [libjvm.so+0x853141]  CallGenerator::do_late_inline_helper()+0x9b1  (callGenerator.cpp:695)
V  [libjvm.so+0x9ea704]  Compile::inline_incrementally_one()+0xd4  (compile.cpp:2015)
V  [libjvm.so+0x9eb873]  Compile::inline_incrementally(PhaseIterGVN&)+0x273  (compile.cpp:2098)
V  [libjvm.so+0x9eebc7]  Compile::Optimize()+0x427  (compile.cpp:2233)
V  [libjvm.so+0x9f1f75]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1aa5  (compile.cpp:839)
V  [libjvm.so+0x84bc04]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x3c4  (c2compiler.cpp:118)
V  [libjvm.so+0x9fdf10]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xa00  (compileBroker.cpp:2265)
V  [libjvm.so+0x9fed98]  CompileBroker::compiler_thread_loop()+0x618  (compileBroker.cpp:1944)
V  [libjvm.so+0xeb6dec]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:719)
V  [libjvm.so+0x17970aa]  Thread::call_run()+0xba  (thread.cpp:217)
V  [libjvm.so+0x149715c]  thread_native_entry(Thread*)+0x11c  (os_linux.cpp:775)
...
model name	: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves nt_good wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid md_clear arch_capabilities
...

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14200#issuecomment-1601442228


More information about the hotspot-dev mailing list