Vector performance issue.

Wed Sep 20 14:46:15 UTC 2023

Hi Jake,

Thanks for reporting this. Are you sure that the "missing constant" bailouts are from the
compilation of bfloatDot (and not from a "partial" compilation of some of the methods it calls)?

When running with the following command line args, all inlining and intrinsification seems to work:

-XX:CompileCommand=quiet -XX:CompileCommand=compileonly,*::bfloatDot -XX:+PrintCompilation
-XX:+PrintInlining -XX:+PrintIntrinsics -XX:-TieredCompilation

I see the same bailouts for other methods, for example:

  53843  796             jdk.incubator.vector.Float512Vector::bOp (17 bytes)
  ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
  ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI

[...]

     @ 96   jdk.internal.vm.vector.VectorSupport::binaryOp (38 bytes)   failed to inline (intrinsic)

But that's expected because the opr argument is non-constant.

Best regards,
Tobias

On 16.09.23 21:43, Jake Luciani wrote:
> Hi,
> 
> I've been struggling with a problem recently using the vector api.
> It appears as reduceLanes is not using the intrinsic.
> 
>           ns  percent  samples  top
>   ----------  -------  -------  ---
>  13240151836   88.21%     1324
> jdk.incubator.vector.FloatVector.reduceLanesTemplate
>   1349991099    8.99%      135
> jdk.incubator.vector.FloatVector.lanewiseTemplate
> 
> I've tested openjdk 20 and 21 and my machine has AVX512.
> 
>  When I PrintIntrinsics I see the following (among others):
> 
>   ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>                                         @ 1
> java.lang.invoke.MethodHandleImpl::isCompileConstant (2 bytes)
> (intrinsic)
>                                                 @ 16
> jdk.internal.util.Preconditions::checkIndex (22 bytes)   (intrinsic)
>                                                 @ 5
> jdk.internal.misc.Unsafe::getIntUnaligned (83 bytes)   (intrinsic)
>                                           @ 55
> java.lang.Float::intBitsToFloat (0 bytes)   (intrinsic)
>                                       @ 12
> jdk.internal.misc.Unsafe::allocateInstance (0 bytes)   (intrinsic)
>                                       @ 1   java.lang.Object::getClass
> (0 bytes)   (intrinsic)
>                                       @ 5   java.lang.Object::getClass
> (0 bytes)   (intrinsic)
>                                               @ 19
> jdk.internal.vm.vector.VectorSupport::fromBitsCoerced (35 bytes)
> (intrinsic)
>                                               @ 1
> java.lang.Object::getClass (0 bytes)   (intrinsic)
>                                               @ 5
> java.lang.Object::getClass (0 bytes)   (intrinsic)
>                                           @ 123
> java.lang.Object::getClass (0 bytes)   (intrinsic)
>                                           @ 154
> jdk.internal.vm.vector.VectorSupport::binaryOp (38 bytes)
> (intrinsic)
>                                   @ 123   java.lang.Object::getClass
> (0 bytes)   (intrinsic)
>                                   @ 154
> jdk.internal.vm.vector.VectorSupport::binaryOp (38 bytes)   failed to
> inline (intrinsic)
> 
> I've included a JMH benchmark that reproduces the issue.
> 
> -Jake
> 
> import jdk.incubator.vector.FloatVector;
> import jdk.incubator.vector.IntVector;
> import jdk.incubator.vector.ShortVector;
> import jdk.incubator.vector.VectorOperators;
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.infra.Blackhole;
> 
> import java.util.concurrent.ThreadLocalRandom;
> import java.util.concurrent.TimeUnit;
> 
> 
> @Warmup(iterations = 1, time = 5)
> @Measurement(iterations = 3, time = 5)
> @Fork(warmups = 1, value = 1, jvmArgsPrepend = {
>         "--add-modules=jdk.incubator.vector",
>         "--enable-preview"})
> public class VectorPerfBench
> {
>     private static final int SIZE = 8192;
>     private static final IntVector BF16_BYTE_SHIFT =
> IntVector.broadcast(IntVector.SPECIES_512, 16);
> 
>     public static short float32ToBFloat16(float f) {
>         return (short) (Float.floatToIntBits(f) >> 16);
>     }
>     @State(Scope.Benchmark)
>     public static class Parameters {
>         final short[] s1 = new short[SIZE];
>         final short[] s2 = new short[SIZE];
> 
>         public Parameters() {
>             for (int i = 0; i < SIZE; i++) {
>                 s1[i] =
> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
>                 s2[i] =
> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
>             }
>         }
>     }
> 
>     @Benchmark
>     @OutputTimeUnit(TimeUnit.MILLISECONDS)
>     @BenchmarkMode(Mode.Throughput)
>     public void bfloatDot(Parameters p, Blackhole bh) {
>         FloatVector acc = FloatVector.zero(FloatVector.SPECIES_512);
>         for (int i = 0; i < SIZE; i += FloatVector.SPECIES_512.length()) {
> 
>             var f1 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s1, i)
>                     .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> IntVector.SPECIES_512, 0)
>                     .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>                     .reinterpretAsFloats();
> 
>             var f2 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s2, i)
>                     .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> IntVector.SPECIES_512, 0)
>                     .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>                     .reinterpretAsFloats();
> 
>             acc = acc.add(f1.mul(f2));
>         }
> 
>         bh.consume(acc.reduceLanes(VectorOperators.ADD));
>     }
> 
>     public static void main(String[] args) throws Exception {
>         org.openjdk.jmh.Main.main(args);
>     }
> }