RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6]
Quan Anh Mai
qamai at openjdk.org
Tue Apr 11 09:38:50 UTC 2023
On Mon, 10 Apr 2023 15:16:59 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Yes I think it is a drawback of this approach, however currently we do not support shuffling for 256-bit vectors on AVX1 machines either, and AVX1 seems to be a special case in this regard. This species of float and double may also be less common in the usage of Vector API since it is larger than SPECIES_PREFERRED.
>
> Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for vector algorithms intercepting both integral and floating point vectors.
>
> FTR, we see a perf regression with Float256 based micro now on AVX=1 targets,
>
>
> public static short micro() {
> VectorShuffle<Float> iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, true);
> return iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1);
> }
>
> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef
> CompileCommand: compileonly shufflef.micro bool compileonly = true
> ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0
> ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0
> @ 17 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 24 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic)
> @ 34 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 54 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic)
> @ 17 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 24 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
> @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic)
> [time] 386ms [res]3392
> CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/
> CPROMPT>export PATH=$JAVA_HOME/bin:$PATH
> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef
> CompileCommand: compileonly shufflef.micro bool compileonly = true
> WARNING: Using incubator modules: jdk.incubator.vector
> @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic)
> @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic)
> @ 17 jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes) (intrinsic)
> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic)
> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic)
> @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic)
> [time] 7ms [res]3392
@jatin-bhateja Since `Float256Shuffle` is represented as a 256-bit int vector, which is not supported by AVX1, the compiled code falls back to Java implementation, which explains the regression. However, having a `VectorShuffle` but not for `Vector::rearrange` is not really useful, and the code snippet is similar to `ShortVector.SPECIES_128.iotaShuffle(0, 1, true).toVector().reinterpretAsShorts().lane(1)`. As a result, I think having some regressions in edge cases of AVX1 is acceptable in contrast with the improvement in all other operations on all platforms.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1162555106
More information about the core-libs-dev
mailing list