Vector performance issue.
Jake Luciani
jake at apache.org
Tue Sep 19 14:48:15 UTC 2023
Thanks Paul.
Attached both compilation and perfasm outputs here
https://gist.github.com/tjake/4b4284e8d697e4a151b0e0877ab04fef
Jake
On Mon, Sep 18, 2023 at 6:36 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> I don’t think that branch is the issue because as long as the method inlines the comparison with the constant (FIRST_NONZERO) will get folded.
>
> Can you share the full output when running with the following HotSpot options?
>
> -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintCompilation
>
> (If possible place in GitHub gist.)
>
>
>
> I don’t have an AVX-512 machine, so I tried with the jdk tip and AVX2 (I don’t think anything substantially has changed in the implementation between the tip and the soon to be released JDK 21).
>
> See snippet below. Generated code is good (turned off loop unrolling to make it easier to see the code: -XX:LoopUnrollLimit=0).
>
> 0x000000011a5dd7c3: xor %ebx,%ebx
> ↗ 0x000000011a5dd7c5: mov $0x2000,%r10d
> 0.01% │ 0x000000011a5dd7cb: sub %ebx,%r10d
> 0.01% │ 0x000000011a5dd7ce: cmp $0x1f40,%r10d
> 0.01% │ 0x000000011a5dd7d5: mov $0x1f40,%edx
> │ 0x000000011a5dd7da: cmova %edx,%r10d
> │ 0x000000011a5dd7de: add %ebx,%r10d
> 0.18% ↗│ 0x000000011a5dd7e1: vmovdqu 0x10(%r8,%rbx,2),%xmm2
> 0.16% ││ 0x000000011a5dd7e8: vmovdqu 0x10(%rcx,%rbx,2),%xmm3
> 15.52% ││ 0x000000011a5dd7ee: vpmovzxwd %xmm2,%ymm2
> 9.08% ││ 0x000000011a5dd7f3: vpsllvd %ymm0,%ymm2,%ymm2
> 0.22% ││ 0x000000011a5dd7f8: vpmovzxwd %xmm3,%ymm3
> 0.09% ││ 0x000000011a5dd7fd: vpsllvd %ymm0,%ymm3,%ymm3
> 15.96% ││ 0x000000011a5dd802: vmulps %ymm3,%ymm2,%ymm2
> 8.70% ││ 0x000000011a5dd806: vaddps %ymm2,%ymm1,%ymm1
> 30.64% ││ 0x000000011a5dd80a: add $0x8,%ebx
> 18.07% ││ 0x000000011a5dd80d: cmp %r10d,%ebx
> ╰│ 0x000000011a5dd810: jl 0x000000011a5dd7e1
> │ 0x000000011a5dd812: mov 0x460(%r15),%rdx
> │ 0x000000011a5dd819: test %eax,(%rdx)
> 0.02% │ 0x000000011a5dd81b: nopl 0x0(%rax,%rax,1)
> 0.01% │ 0x000000011a5dd820: cmp $0x2000,%ebx
> ╰ 0x000000011a5dd826: jl 0x000000011a5dd7c5
> 0x000000011a5dd828: mov %rdi,0x78(%rsp)
> 0x000000011a5dd82d: mov %r9,0x60(%rsp)
> 0x000000011a5dd832: vxorps %xmm0,%xmm0,%xmm0
> 0.02% 0x000000011a5dd836: vaddss %xmm1,%xmm0,%xmm0
> 0.05% 0x000000011a5dd83a: vpshufd $0x1,%xmm1,%xmm3
> 0.01% 0x000000011a5dd83f: vaddss %xmm3,%xmm0,%xmm0
> 0.08% 0x000000011a5dd843: vpshufd $0x2,%xmm1,%xmm3
> 0.07% 0x000000011a5dd848: vaddss %xmm3,%xmm0,%xmm0
> 0.04% 0x000000011a5dd84c: vpshufd $0x3,%xmm1,%xmm3
> 0.05% 0x000000011a5dd851: vaddss %xmm3,%xmm0,%xmm0
> 0.06% 0x000000011a5dd855: vextractf128 $0x1,%ymm1,%xmm3
> 0.02% 0x000000011a5dd85b: vaddss %xmm3,%xmm0,%xmm0
> 0.04% 0x000000011a5dd85f: vpshufd $0x1,%xmm3,%xmm2
> 0.02% 0x000000011a5dd864: vaddss %xmm2,%xmm0,%xmm0
> 0.03% 0x000000011a5dd868: vpshufd $0x2,%xmm3,%xmm2
> 0.04% 0x000000011a5dd86d: vaddss %xmm2,%xmm0,%xmm0
> 0.06% 0x000000011a5dd871: vpshufd $0x3,%xmm3,%xmm2
> 0.01% 0x000000011a5dd876: vaddss %xmm2,%xmm0,%xmm0
>
>
>
> Paul.
>
>
> @Benchmark
> @OutputTimeUnit(TimeUnit.MILLISECONDS)
> @BenchmarkMode(Mode.Throughput)
> public float bfloatDot(Parameters p) {
> FloatVector acc = FloatVector.zero(FloatVector.SPECIES_256);
> for (int i = 0; i < SIZE; i += FloatVector.SPECIES_256.length()) {
> var f1 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s1, i)
> .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> IntVector.SPECIES_256, 0)
> .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
> .reinterpretAsFloats();
>
> var f2 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s2, i)
> .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> IntVector.SPECIES_256, 0)
> .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
> .reinterpretAsFloats();
>
> acc = acc.add(f1.mul(f2));
> }
>
> return acc.reduceLanes(VectorOperators.ADD);
> }
>
>
> > On Sep 18, 2023, at 8:45 AM, Jake Luciani <jake at apache.org> wrote:
> >
> > Looking at the code I wonder if it's this extra branch?
> >
> > @ForceInline
> > final
> > float reduceLanesTemplate(VectorOperators.Associative op,
> > Class<? extends VectorMask<Float>> maskClass,
> > VectorMask<Float> m) {
> > m.check(maskClass, this);
> > if (op == FIRST_NONZERO) {
> > // FIXME: The JIT should handle this.
> > FloatVector v = broadcast((float) 0).blend(this, m);
> > return v.reduceLanesTemplate(op);
> > }
> > int opc = opCode(op);
> > return fromBits(VectorSupport.reductionCoerced(
> > opc, getClass(), maskClass, float.class, length(),
> > this, m,
> > REDUCE_IMPL.find(op, opc, FloatVector::reductionOperations)));
> > }
> >
> > On Mon, Sep 18, 2023 at 11:11 AM Andrii Lomakin
> > <lomakin.andrey at gmail.com> wrote:
> >>
> >> Hi,
> >> I have the same problem during calculation of Eucledian distance in my project too.
> >> Writing just to confirm that it is not a single case and I have got the same result during profiling.
> >>
> >> On Sat, Sep 16, 2023 at 9:50 PM Jake Luciani <jake at apache.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I've been struggling with a problem recently using the vector api.
> >>> It appears as reduceLanes is not using the intrinsic.
> >>>
> >>> ns percent samples top
> >>> ---------- ------- ------- ---
> >>> 13240151836 88.21% 1324
> >>> jdk.incubator.vector.FloatVector.reduceLanesTemplate
> >>> 1349991099 8.99% 135
> >>> jdk.incubator.vector.FloatVector.lanewiseTemplate
> >>>
> >>> I've tested openjdk 20 and 21 and my machine has AVX512.
> >>>
> >>> When I PrintIntrinsics I see the following (among others):
> >>>
> >>> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
> >>>
> >>> I've included a JMH benchmark that reproduces the issue.
> >>>
> >>> -Jake
> >>>
> >>> import jdk.incubator.vector.FloatVector;
> >>> import jdk.incubator.vector.IntVector;
> >>> import jdk.incubator.vector.ShortVector;
> >>> import jdk.incubator.vector.VectorOperators;
> >>> import org.openjdk.jmh.annotations.*;
> >>> import org.openjdk.jmh.infra.Blackhole;
> >>>
> >>> import java.util.concurrent.ThreadLocalRandom;
> >>> import java.util.concurrent.TimeUnit;
> >>>
> >>>
> >>> @Warmup(iterations = 1, time = 5)
> >>> @Measurement(iterations = 3, time = 5)
> >>> @Fork(warmups = 1, value = 1, jvmArgsPrepend = {
> >>> "--add-modules=jdk.incubator.vector",
> >>> "--enable-preview"})
> >>> public class VectorPerfBench
> >>> {
> >>> private static final int SIZE = 8192;
> >>> private static final IntVector BF16_BYTE_SHIFT =
> >>> IntVector.broadcast(IntVector.SPECIES_512, 16);
> >>>
> >>> public static short float32ToBFloat16(float f) {
> >>> return (short) (Float.floatToIntBits(f) >> 16);
> >>> }
> >>> @State(Scope.Benchmark)
> >>> public static class Parameters {
> >>> final short[] s1 = new short[SIZE];
> >>> final short[] s2 = new short[SIZE];
> >>>
> >>> public Parameters() {
> >>> for (int i = 0; i < SIZE; i++) {
> >>> s1[i] =
> >>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
> >>> s2[i] =
> >>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
> >>> }
> >>> }
> >>> }
> >>>
> >>> @Benchmark
> >>> @OutputTimeUnit(TimeUnit.MILLISECONDS)
> >>> @BenchmarkMode(Mode.Throughput)
> >>> public void bfloatDot(Parameters p, Blackhole bh) {
> >>> FloatVector acc = FloatVector.zero(FloatVector.SPECIES_512);
> >>> for (int i = 0; i < SIZE; i += FloatVector.SPECIES_512.length()) {
> >>>
> >>> var f1 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s1, i)
> >>> .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> >>> IntVector.SPECIES_512, 0)
> >>> .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
> >>> .reinterpretAsFloats();
> >>>
> >>> var f2 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s2, i)
> >>> .convertShape(VectorOperators.ZERO_EXTEND_S2I,
> >>> IntVector.SPECIES_512, 0)
> >>> .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
> >>> .reinterpretAsFloats();
> >>>
> >>> acc = acc.add(f1.mul(f2));
> >>> }
> >>>
> >>> bh.consume(acc.reduceLanes(VectorOperators.ADD));
> >>> }
> >>>
> >>> public static void main(String[] args) throws Exception {
> >>> org.openjdk.jmh.Main.main(args);
> >>> }
> >>> }
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Andrii Lomakin.
> >>
>
More information about the panama-dev
mailing list