Vector performance issue.
Paul Sandoz
paul.sandoz at oracle.com
Mon Sep 18 22:36:38 UTC 2023
I don’t think that branch is the issue because as long as the method inlines the comparison with the constant (FIRST_NONZERO) will get folded.
Can you share the full output when running with the following HotSpot options?
-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintCompilation
(If possible place in GitHub gist.)
I don’t have an AVX-512 machine, so I tried with the jdk tip and AVX2 (I don’t think anything substantially has changed in the implementation between the tip and the soon to be released JDK 21).
See snippet below. Generated code is good (turned off loop unrolling to make it easier to see the code: -XX:LoopUnrollLimit=0).
0x000000011a5dd7c3: xor %ebx,%ebx
↗ 0x000000011a5dd7c5: mov $0x2000,%r10d
0.01% │ 0x000000011a5dd7cb: sub %ebx,%r10d
0.01% │ 0x000000011a5dd7ce: cmp $0x1f40,%r10d
0.01% │ 0x000000011a5dd7d5: mov $0x1f40,%edx
│ 0x000000011a5dd7da: cmova %edx,%r10d
│ 0x000000011a5dd7de: add %ebx,%r10d
0.18% ↗│ 0x000000011a5dd7e1: vmovdqu 0x10(%r8,%rbx,2),%xmm2
0.16% ││ 0x000000011a5dd7e8: vmovdqu 0x10(%rcx,%rbx,2),%xmm3
15.52% ││ 0x000000011a5dd7ee: vpmovzxwd %xmm2,%ymm2
9.08% ││ 0x000000011a5dd7f3: vpsllvd %ymm0,%ymm2,%ymm2
0.22% ││ 0x000000011a5dd7f8: vpmovzxwd %xmm3,%ymm3
0.09% ││ 0x000000011a5dd7fd: vpsllvd %ymm0,%ymm3,%ymm3
15.96% ││ 0x000000011a5dd802: vmulps %ymm3,%ymm2,%ymm2
8.70% ││ 0x000000011a5dd806: vaddps %ymm2,%ymm1,%ymm1
30.64% ││ 0x000000011a5dd80a: add $0x8,%ebx
18.07% ││ 0x000000011a5dd80d: cmp %r10d,%ebx
╰│ 0x000000011a5dd810: jl 0x000000011a5dd7e1
│ 0x000000011a5dd812: mov 0x460(%r15),%rdx
│ 0x000000011a5dd819: test %eax,(%rdx)
0.02% │ 0x000000011a5dd81b: nopl 0x0(%rax,%rax,1)
0.01% │ 0x000000011a5dd820: cmp $0x2000,%ebx
╰ 0x000000011a5dd826: jl 0x000000011a5dd7c5
0x000000011a5dd828: mov %rdi,0x78(%rsp)
0x000000011a5dd82d: mov %r9,0x60(%rsp)
0x000000011a5dd832: vxorps %xmm0,%xmm0,%xmm0
0.02% 0x000000011a5dd836: vaddss %xmm1,%xmm0,%xmm0
0.05% 0x000000011a5dd83a: vpshufd $0x1,%xmm1,%xmm3
0.01% 0x000000011a5dd83f: vaddss %xmm3,%xmm0,%xmm0
0.08% 0x000000011a5dd843: vpshufd $0x2,%xmm1,%xmm3
0.07% 0x000000011a5dd848: vaddss %xmm3,%xmm0,%xmm0
0.04% 0x000000011a5dd84c: vpshufd $0x3,%xmm1,%xmm3
0.05% 0x000000011a5dd851: vaddss %xmm3,%xmm0,%xmm0
0.06% 0x000000011a5dd855: vextractf128 $0x1,%ymm1,%xmm3
0.02% 0x000000011a5dd85b: vaddss %xmm3,%xmm0,%xmm0
0.04% 0x000000011a5dd85f: vpshufd $0x1,%xmm3,%xmm2
0.02% 0x000000011a5dd864: vaddss %xmm2,%xmm0,%xmm0
0.03% 0x000000011a5dd868: vpshufd $0x2,%xmm3,%xmm2
0.04% 0x000000011a5dd86d: vaddss %xmm2,%xmm0,%xmm0
0.06% 0x000000011a5dd871: vpshufd $0x3,%xmm3,%xmm2
0.01% 0x000000011a5dd876: vaddss %xmm2,%xmm0,%xmm0
Paul.
@Benchmark
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.Throughput)
public float bfloatDot(Parameters p) {
FloatVector acc = FloatVector.zero(FloatVector.SPECIES_256);
for (int i = 0; i < SIZE; i += FloatVector.SPECIES_256.length()) {
var f1 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s1, i)
.convertShape(VectorOperators.ZERO_EXTEND_S2I,
IntVector.SPECIES_256, 0)
.lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
.reinterpretAsFloats();
var f2 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s2, i)
.convertShape(VectorOperators.ZERO_EXTEND_S2I,
IntVector.SPECIES_256, 0)
.lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
.reinterpretAsFloats();
acc = acc.add(f1.mul(f2));
}
return acc.reduceLanes(VectorOperators.ADD);
}
> On Sep 18, 2023, at 8:45 AM, Jake Luciani <jake at apache.org> wrote:
>
> Looking at the code I wonder if it's this extra branch?
>
> @ForceInline
> final
> float reduceLanesTemplate(VectorOperators.Associative op,
> Class<? extends VectorMask<Float>> maskClass,
> VectorMask<Float> m) {
> m.check(maskClass, this);
> if (op == FIRST_NONZERO) {
> // FIXME: The JIT should handle this.
> FloatVector v = broadcast((float) 0).blend(this, m);
> return v.reduceLanesTemplate(op);
> }
> int opc = opCode(op);
> return fromBits(VectorSupport.reductionCoerced(
> opc, getClass(), maskClass, float.class, length(),
> this, m,
> REDUCE_IMPL.find(op, opc, FloatVector::reductionOperations)));
> }
>
> On Mon, Sep 18, 2023 at 11:11 AM Andrii Lomakin
> <lomakin.andrey at gmail.com> wrote:
>>
>> Hi,
>> I have the same problem during calculation of Eucledian distance in my project too.
>> Writing just to confirm that it is not a single case and I have got the same result during profiling.
>>
>> On Sat, Sep 16, 2023 at 9:50 PM Jake Luciani <jake at apache.org> wrote:
>>>
>>> Hi,
>>>
>>> I've been struggling with a problem recently using the vector api.
>>> It appears as reduceLanes is not using the intrinsic.
>>>
>>> ns percent samples top
>>> ---------- ------- ------- ---
>>> 13240151836 88.21% 1324
>>> jdk.incubator.vector.FloatVector.reduceLanesTemplate
>>> 1349991099 8.99% 135
>>> jdk.incubator.vector.FloatVector.lanewiseTemplate
>>>
>>> I've tested openjdk 20 and 21 and my machine has AVX512.
>>>
>>> When I PrintIntrinsics I see the following (among others):
>>>
>>> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>>>
>>> I've included a JMH benchmark that reproduces the issue.
>>>
>>> -Jake
>>>
>>> import jdk.incubator.vector.FloatVector;
>>> import jdk.incubator.vector.IntVector;
>>> import jdk.incubator.vector.ShortVector;
>>> import jdk.incubator.vector.VectorOperators;
>>> import org.openjdk.jmh.annotations.*;
>>> import org.openjdk.jmh.infra.Blackhole;
>>>
>>> import java.util.concurrent.ThreadLocalRandom;
>>> import java.util.concurrent.TimeUnit;
>>>
>>>
>>> @Warmup(iterations = 1, time = 5)
>>> @Measurement(iterations = 3, time = 5)
>>> @Fork(warmups = 1, value = 1, jvmArgsPrepend = {
>>> "--add-modules=jdk.incubator.vector",
>>> "--enable-preview"})
>>> public class VectorPerfBench
>>> {
>>> private static final int SIZE = 8192;
>>> private static final IntVector BF16_BYTE_SHIFT =
>>> IntVector.broadcast(IntVector.SPECIES_512, 16);
>>>
>>> public static short float32ToBFloat16(float f) {
>>> return (short) (Float.floatToIntBits(f) >> 16);
>>> }
>>> @State(Scope.Benchmark)
>>> public static class Parameters {
>>> final short[] s1 = new short[SIZE];
>>> final short[] s2 = new short[SIZE];
>>>
>>> public Parameters() {
>>> for (int i = 0; i < SIZE; i++) {
>>> s1[i] =
>>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
>>> s2[i] =
>>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
>>> }
>>> }
>>> }
>>>
>>> @Benchmark
>>> @OutputTimeUnit(TimeUnit.MILLISECONDS)
>>> @BenchmarkMode(Mode.Throughput)
>>> public void bfloatDot(Parameters p, Blackhole bh) {
>>> FloatVector acc = FloatVector.zero(FloatVector.SPECIES_512);
>>> for (int i = 0; i < SIZE; i += FloatVector.SPECIES_512.length()) {
>>>
>>> var f1 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s1, i)
>>> .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>>> IntVector.SPECIES_512, 0)
>>> .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>>> .reinterpretAsFloats();
>>>
>>> var f2 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s2, i)
>>> .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>>> IntVector.SPECIES_512, 0)
>>> .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>>> .reinterpretAsFloats();
>>>
>>> acc = acc.add(f1.mul(f2));
>>> }
>>>
>>> bh.consume(acc.reduceLanes(VectorOperators.ADD));
>>> }
>>>
>>> public static void main(String[] args) throws Exception {
>>> org.openjdk.jmh.Main.main(args);
>>> }
>>> }
>>
>>
>>
>> --
>> Best regards,
>> Andrii Lomakin.
>>
More information about the panama-dev
mailing list