Vector performance issue.

Jake Luciani jake at apache.org
Tue Sep 19 14:48:15 UTC 2023


Thanks Paul.

Attached both compilation and perfasm outputs here
https://gist.github.com/tjake/4b4284e8d697e4a151b0e0877ab04fef

Jake

On Mon, Sep 18, 2023 at 6:36 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> I don’t think that branch is the issue because as long as the method inlines the comparison with the constant (FIRST_NONZERO) will get folded.
>
> Can you share the full output when running with the following HotSpot options?
>
> -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintCompilation
>
> (If possible place in GitHub gist.)
>
>
>
> I don’t have an AVX-512 machine, so I tried with the jdk tip and AVX2 (I don’t think anything substantially has changed in the implementation between the tip and the soon to be released JDK 21).
>
> See snippet below. Generated code is good (turned off loop unrolling to make it easier to see the code: -XX:LoopUnrollLimit=0).
>
>              0x000000011a5dd7c3:   xor    %ebx,%ebx
>           ↗  0x000000011a5dd7c5:   mov    $0x2000,%r10d
>   0.01%   │  0x000000011a5dd7cb:   sub    %ebx,%r10d
>   0.01%   │  0x000000011a5dd7ce:   cmp    $0x1f40,%r10d
>   0.01%   │  0x000000011a5dd7d5:   mov    $0x1f40,%edx
>           │  0x000000011a5dd7da:   cmova  %edx,%r10d
>           │  0x000000011a5dd7de:   add    %ebx,%r10d
>   0.18%  ↗│  0x000000011a5dd7e1:   vmovdqu 0x10(%r8,%rbx,2),%xmm2
>   0.16%  ││  0x000000011a5dd7e8:   vmovdqu 0x10(%rcx,%rbx,2),%xmm3
>  15.52%  ││  0x000000011a5dd7ee:   vpmovzxwd %xmm2,%ymm2
>   9.08%  ││  0x000000011a5dd7f3:   vpsllvd %ymm0,%ymm2,%ymm2
>   0.22%  ││  0x000000011a5dd7f8:   vpmovzxwd %xmm3,%ymm3
>   0.09%  ││  0x000000011a5dd7fd:   vpsllvd %ymm0,%ymm3,%ymm3
>  15.96%  ││  0x000000011a5dd802:   vmulps %ymm3,%ymm2,%ymm2
>   8.70%  ││  0x000000011a5dd806:   vaddps %ymm2,%ymm1,%ymm1
>  30.64%  ││  0x000000011a5dd80a:   add    $0x8,%ebx
>  18.07%  ││  0x000000011a5dd80d:   cmp    %r10d,%ebx
>          ╰│  0x000000011a5dd810:   jl     0x000000011a5dd7e1
>           │  0x000000011a5dd812:   mov    0x460(%r15),%rdx
>           │  0x000000011a5dd819:   test   %eax,(%rdx)
>   0.02%   │  0x000000011a5dd81b:   nopl   0x0(%rax,%rax,1)
>   0.01%   │  0x000000011a5dd820:   cmp    $0x2000,%ebx
>           ╰  0x000000011a5dd826:   jl     0x000000011a5dd7c5
>              0x000000011a5dd828:   mov    %rdi,0x78(%rsp)
>              0x000000011a5dd82d:   mov    %r9,0x60(%rsp)
>              0x000000011a5dd832:   vxorps %xmm0,%xmm0,%xmm0
>   0.02%      0x000000011a5dd836:   vaddss %xmm1,%xmm0,%xmm0
>   0.05%      0x000000011a5dd83a:   vpshufd $0x1,%xmm1,%xmm3
>   0.01%      0x000000011a5dd83f:   vaddss %xmm3,%xmm0,%xmm0
>   0.08%      0x000000011a5dd843:   vpshufd $0x2,%xmm1,%xmm3
>   0.07%      0x000000011a5dd848:   vaddss %xmm3,%xmm0,%xmm0
>   0.04%      0x000000011a5dd84c:   vpshufd $0x3,%xmm1,%xmm3
>   0.05%      0x000000011a5dd851:   vaddss %xmm3,%xmm0,%xmm0
>   0.06%      0x000000011a5dd855:   vextractf128 $0x1,%ymm1,%xmm3
>   0.02%      0x000000011a5dd85b:   vaddss %xmm3,%xmm0,%xmm0
>   0.04%      0x000000011a5dd85f:   vpshufd $0x1,%xmm3,%xmm2
>   0.02%      0x000000011a5dd864:   vaddss %xmm2,%xmm0,%xmm0
>   0.03%      0x000000011a5dd868:   vpshufd $0x2,%xmm3,%xmm2
>   0.04%      0x000000011a5dd86d:   vaddss %xmm2,%xmm0,%xmm0
>   0.06%      0x000000011a5dd871:   vpshufd $0x3,%xmm3,%xmm2
>   0.01%      0x000000011a5dd876:   vaddss %xmm2,%xmm0,%xmm0
>
>
>
> Paul.
>
>
>     @Benchmark
>     @OutputTimeUnit(TimeUnit.MILLISECONDS)
>     @BenchmarkMode(Mode.Throughput)
>     public float bfloatDot(Parameters p) {
>         FloatVector acc = FloatVector.zero(FloatVector.SPECIES_256);
>         for (int i = 0; i < SIZE; i += FloatVector.SPECIES_256.length()) {
>             var f1 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s1, i)
>                     .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>                             IntVector.SPECIES_256, 0)
>                     .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>                     .reinterpretAsFloats();
>
>             var f2 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s2, i)
>                     .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>                             IntVector.SPECIES_256, 0)
>                     .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>                     .reinterpretAsFloats();
>
>             acc = acc.add(f1.mul(f2));
>         }
>
>         return acc.reduceLanes(VectorOperators.ADD);
>     }
>
>
> > On Sep 18, 2023, at 8:45 AM, Jake Luciani <jake at apache.org> wrote:
> >
> > Looking at the code I wonder if it's this extra branch?
> >
> > @ForceInline
> > final
> > float reduceLanesTemplate(VectorOperators.Associative op,
> >                           Class<? extends VectorMask<Float>> maskClass,
> >                           VectorMask<Float> m) {
> >    m.check(maskClass, this);
> >    if (op == FIRST_NONZERO) {
> >        // FIXME:  The JIT should handle this.
> >        FloatVector v = broadcast((float) 0).blend(this, m);
> >        return v.reduceLanesTemplate(op);
> >    }
> >    int opc = opCode(op);
> >    return fromBits(VectorSupport.reductionCoerced(
> >        opc, getClass(), maskClass, float.class, length(),
> >        this, m,
> >        REDUCE_IMPL.find(op, opc, FloatVector::reductionOperations)));
> > }
> >
> > On Mon, Sep 18, 2023 at 11:11 AM Andrii Lomakin
> > <lomakin.andrey at gmail.com> wrote:
> >>
> >> Hi,
> >> I have the same problem during calculation of Eucledian distance in my project too.
> >> Writing just to confirm that it is not a single case and I have got the same result during profiling.
> >>
> >> On Sat, Sep 16, 2023 at 9:50 PM Jake Luciani <jake at apache.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I've been struggling with a problem recently using the vector api.
> >>> It appears as reduceLanes is not using the intrinsic.
> >>>
> >>>          ns  percent  samples  top
> >>>  ----------  -------  -------  ---
> >>> 13240151836   88.21%     1324
> >>> jdk.incubator.vector.FloatVector.reduceLanesTemplate
> >>>  1349991099    8.99%      135
> >>> jdk.incubator.vector.FloatVector.lanewiseTemplate
> >>>
> >>> I've tested openjdk 20 and 21 and my machine has AVX512.
> >>>
> >>> When I PrintIntrinsics I see the following (among others):
> >>>
> >>>  ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
> >>>
> >>> I've included a JMH benchmark that reproduces the issue.
> >>>
> >>> -Jake
> >>>
> >>> import jdk.incubator.vector.FloatVector;
> >>> import jdk.incubator.vector.IntVector;
> >>> import jdk.incubator.vector.ShortVector;
> >>> import jdk.incubator.vector.VectorOperators;
> >>> import org.openjdk.jmh.annotations.*;
> >>> import org.openjdk.jmh.infra.Blackhole;
> >>>
> >>> import java.util.concurrent.ThreadLocalRandom;
> >>> import java.util.concurrent.TimeUnit;
> >>>
> >>>
> >>> @Warmup(iterations = 1, time = 5)
> >>> @Measurement(iterations = 3, time = 5)
> >>> @Fork(warmups = 1, value = 1, jvmArgsPrepend = {
> >>>        "--add-modules=jdk.incubator.vector",
> >>>        "--enable-preview"})
> >>> public class VectorPerfBench
> >>> {
> >>>    private static final int SIZE = 8192;
> >>>    private static final IntVector BF16_BYTE_SHIFT =
> >>> IntVector.broadcast(IntVector.SPECIES_512, 16);
> >>>
> >>>    public static short float32ToBFloat16(float f) {
> >>>        return (short) (Float.floatToIntBits(f) >> 16);
> >>>    }
> >>>    @State(Scope.Benchmark)
> >>>    public static class Parameters {
> >>>        final short[] s1 = new short[SIZE];
> >>>        final short[] s2 = new short[SIZE];
> >>>
> >>>        public Parameters() {
> >>>            for (int i = 0; i < SIZE; i++) {
> >>>                s1[i] =
> >>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
> >>>                s2[i] =
> >>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
> >>>            }
> >>>        }
> >>>    }
> >>>
> >>>    @Benchmark
> >>>    @OutputTimeUnit(TimeUnit.MILLISECONDS)
> >>>    @BenchmarkMode(Mode.Throughput)
> >>>    public void bfloatDot(Parameters p, Blackhole bh) {
> >>>        FloatVector acc = FloatVector.zero(FloatVector.SPECIES_512);
> >>>        for (int i = 0; i < SIZE; i += FloatVector.SPECIES_512.length()) {
> >>>
> >>>            var f1 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s1, i)
> >>>                    .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> >>> IntVector.SPECIES_512, 0)
> >>>                    .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
> >>>                    .reinterpretAsFloats();
> >>>
> >>>            var f2 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s2, i)
> >>>                    .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> >>> IntVector.SPECIES_512, 0)
> >>>                    .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
> >>>                    .reinterpretAsFloats();
> >>>
> >>>            acc = acc.add(f1.mul(f2));
> >>>        }
> >>>
> >>>        bh.consume(acc.reduceLanes(VectorOperators.ADD));
> >>>    }
> >>>
> >>>    public static void main(String[] args) throws Exception {
> >>>        org.openjdk.jmh.Main.main(args);
> >>>    }
> >>> }
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Andrii Lomakin.
> >>
>


More information about the panama-dev mailing list