AVX512 intrinsics not taken on CPU supporting AVX512?

Sat May 30 14:15:49 UTC 2020

Hi Sandhya,

using -XX:UseAVX=3 worked, thanks! Performance of the mentioned copying
came down from 2.1 ns/op for 256-bit to a nice 1.6 ns/op for 512-bit
vectors.
That actually makes me think why the intrinsification of 512-bit vector
species
operations is linked to the explicit opt-in of the -XX:UseAVX=3 option.
I mean, it makes sense that the JIT autovectorizer generates AVX512 code
only
with that toggle enabled, since otherwise we might get performance
degradation
and the JVM might "know better" for such hidden things which the user can't
control anyways.
But when it comes to vector species and I explicitly use the 512-bit species
then I am practically stating that "I know better" and I am asking for
AVX512
on x86 and not at all intrinsifying the ops and falling back to the Java
code
is really not an option there.

Thanks!

Am Sa., 30. Mai 2020 um 01:53 Uhr schrieb Viswanathan, Sandhya <
sandhya.viswanathan at intel.com>:

> Hi Kai,
>
> Please try with explicitly specifying -XX:UseAVX=3 on JVM command line.
>
> Best Regards,
> Sandhya
>
>
> -----Original Message-----
> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Kai
> Burjack
> Sent: Friday, May 29, 2020 11:21 AM
> To: panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
> Subject: AVX512 intrinsics not taken on CPU supporting AVX512?
>
> I was just measuring performance of this code:
> ```
> fromArray(SPECIES_512, es, 0).intoByteBuffer(bb, 0, nativeOrder()); ```
> comparing it with:
> ```
> fromArray(SPECIES_256, es, 0).intoByteBuffer(bb, 0, nativeOrder());
> fromArray(SPECIES_256, es, 8).intoByteBuffer(bb, 32, nativeOrder()); ```
> and found that the former was more than 10x slower than the latter on a
> Xeon Platinum 8124M, which according to cpuinfo does support AVX512:
> ```
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep
> bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb
> avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku
> ospke ``` Are AVX512 mov intrinsics not implemented right now or why are
> they not taken?
> Thanks!
>
> Current benchmark results:
>
> https://github.com/JOML-CI/panama-vector-bench#with--djdkincubatorvectorvector_access_oob_check0-and-abstractshufflecheckindexes_use_vector_access_oob_checkpatch-1
>
> Kai.
>