RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6]
Jatin Bhateja
jbhateja at openjdk.org
Tue Apr 11 17:50:48 UTC 2023
On Tue, 11 Apr 2023 09:36:06 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:
>> Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for vector algorithms intercepting both integral and floating point vectors.
>>
>> FTR, we see a perf regression with Float256 based micro now on AVX=1 targets,
>>
>>
>> public static short micro() {
>> VectorShuffle<Float> iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, true);
>> return iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1);
>> }
>>
>> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef
>> CompileCommand: compileonly shufflef.micro bool compileonly = true
>> ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0
>> ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0
>> @ 17 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 24 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic)
>> @ 34 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 54 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic)
>> @ 17 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 24 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
>> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
>> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
>> @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic)
>> [time] 386ms [res]3392
>> CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/
>> CPROMPT>export PATH=$JAVA_HOME/bin:$PATH
>> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef
>> CompileCommand: compileonly shufflef.micro bool compileonly = true
>> WARNING: Using incubator modules: jdk.incubator.vector
>> @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic)
>> @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic)
>> @ 17 jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes) (intrinsic)
>> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic)
>> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
>> @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic)
>> [time] 7ms [res]3392
>
> @jatin-bhateja Since `Float256Shuffle` is represented as a 256-bit int vector, which is not supported by AVX1, the compiled code falls back to Java implementation, which explains the regression. However, having a `VectorShuffle` but not for `Vector::rearrange` is not really useful, and the code snippet is similar to `ShortVector.SPECIES_128.iotaShuffle(0, 1, true).toVector().reinterpretAsShorts().lane(1)`. As a result, I think having some regressions in edge cases of AVX1 is acceptable in contrast with the improvement in all other operations on all platforms.
Agree, this is also fixing less than 32 bit shuffle vectors case, i.e. shuffles involving Long128, Int64 and Float64 will get benefitted on x86.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1163147535
More information about the hotspot-compiler-dev
mailing list