Vector performance issue.

Tue Sep 19 16:35:18 UTC 2023

It all looks ok to me. Generated code is good. Version 4 of the compiled method might be with the loop on stack replaced, where as version 5 is not (cannot correlate the compilation ids between the two outputs). For fun you could turn off on stack replacement off and see what happens.

If you run with tiered compilation then it might take longer for HotSpot to settle and generate the optimal code. Perhaps that is what was happening before?

Paul.

> On Sep 19, 2023, at 7:48 AM, Jake Luciani <jake at apache.org> wrote:
> 
> Thanks Paul.
> 
> Attached both compilation and perfasm outputs here
> https://urldefense.com/v3/__https://gist.github.com/tjake/4b4284e8d697e4a151b0e0877ab04fef__;!!ACWV5N9M2RV99hQ!K_XSCEOj8WGN92648DwfjUCJuoG8FTiuplemFcjucH3LC5KVhnvZGNAtXwk998QwvRpe1BOhmllctw$ 
> 
> Jake
> 
> On Mon, Sep 18, 2023 at 6:36 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
>> 
>> I don’t think that branch is the issue because as long as the method inlines the comparison with the constant (FIRST_NONZERO) will get folded.
>> 
>> Can you share the full output when running with the following HotSpot options?
>> 
>> -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintCompilation
>> 
>> (If possible place in GitHub gist.)
>> 
>> 
>> 
>> I don’t have an AVX-512 machine, so I tried with the jdk tip and AVX2 (I don’t think anything substantially has changed in the implementation between the tip and the soon to be released JDK 21).
>> 
>> See snippet below. Generated code is good (turned off loop unrolling to make it easier to see the code: -XX:LoopUnrollLimit=0).
>> 
>>             0x000000011a5dd7c3:   xor    %ebx,%ebx
>>          ↗  0x000000011a5dd7c5:   mov    $0x2000,%r10d
>>  0.01%   │  0x000000011a5dd7cb:   sub    %ebx,%r10d
>>  0.01%   │  0x000000011a5dd7ce:   cmp    $0x1f40,%r10d
>>  0.01%   │  0x000000011a5dd7d5:   mov    $0x1f40,%edx
>>          │  0x000000011a5dd7da:   cmova  %edx,%r10d
>>          │  0x000000011a5dd7de:   add    %ebx,%r10d
>>  0.18%  ↗│  0x000000011a5dd7e1:   vmovdqu 0x10(%r8,%rbx,2),%xmm2
>>  0.16%  ││  0x000000011a5dd7e8:   vmovdqu 0x10(%rcx,%rbx,2),%xmm3
>> 15.52%  ││  0x000000011a5dd7ee:   vpmovzxwd %xmm2,%ymm2
>>  9.08%  ││  0x000000011a5dd7f3:   vpsllvd %ymm0,%ymm2,%ymm2
>>  0.22%  ││  0x000000011a5dd7f8:   vpmovzxwd %xmm3,%ymm3
>>  0.09%  ││  0x000000011a5dd7fd:   vpsllvd %ymm0,%ymm3,%ymm3
>> 15.96%  ││  0x000000011a5dd802:   vmulps %ymm3,%ymm2,%ymm2
>>  8.70%  ││  0x000000011a5dd806:   vaddps %ymm2,%ymm1,%ymm1
>> 30.64%  ││  0x000000011a5dd80a:   add    $0x8,%ebx
>> 18.07%  ││  0x000000011a5dd80d:   cmp    %r10d,%ebx
>>         ╰│  0x000000011a5dd810:   jl     0x000000011a5dd7e1
>>          │  0x000000011a5dd812:   mov    0x460(%r15),%rdx
>>          │  0x000000011a5dd819:   test   %eax,(%rdx)
>>  0.02%   │  0x000000011a5dd81b:   nopl   0x0(%rax,%rax,1)
>>  0.01%   │  0x000000011a5dd820:   cmp    $0x2000,%ebx
>>          ╰  0x000000011a5dd826:   jl     0x000000011a5dd7c5
>>             0x000000011a5dd828:   mov    %rdi,0x78(%rsp)
>>             0x000000011a5dd82d:   mov    %r9,0x60(%rsp)
>>             0x000000011a5dd832:   vxorps %xmm0,%xmm0,%xmm0
>>  0.02%      0x000000011a5dd836:   vaddss %xmm1,%xmm0,%xmm0
>>  0.05%      0x000000011a5dd83a:   vpshufd $0x1,%xmm1,%xmm3
>>  0.01%      0x000000011a5dd83f:   vaddss %xmm3,%xmm0,%xmm0
>>  0.08%      0x000000011a5dd843:   vpshufd $0x2,%xmm1,%xmm3
>>  0.07%      0x000000011a5dd848:   vaddss %xmm3,%xmm0,%xmm0
>>  0.04%      0x000000011a5dd84c:   vpshufd $0x3,%xmm1,%xmm3
>>  0.05%      0x000000011a5dd851:   vaddss %xmm3,%xmm0,%xmm0
>>  0.06%      0x000000011a5dd855:   vextractf128 $0x1,%ymm1,%xmm3
>>  0.02%      0x000000011a5dd85b:   vaddss %xmm3,%xmm0,%xmm0
>>  0.04%      0x000000011a5dd85f:   vpshufd $0x1,%xmm3,%xmm2
>>  0.02%      0x000000011a5dd864:   vaddss %xmm2,%xmm0,%xmm0
>>  0.03%      0x000000011a5dd868:   vpshufd $0x2,%xmm3,%xmm2
>>  0.04%      0x000000011a5dd86d:   vaddss %xmm2,%xmm0,%xmm0
>>  0.06%      0x000000011a5dd871:   vpshufd $0x3,%xmm3,%xmm2
>>  0.01%      0x000000011a5dd876:   vaddss %xmm2,%xmm0,%xmm0
>> 
>> 
>> 
>> Paul.
>> 
>> 
>>    @Benchmark
>>    @OutputTimeUnit(TimeUnit.MILLISECONDS)
>>    @BenchmarkMode(Mode.Throughput)
>>    public float bfloatDot(Parameters p) {
>>        FloatVector acc = FloatVector.zero(FloatVector.SPECIES_256);
>>        for (int i = 0; i < SIZE; i += FloatVector.SPECIES_256.length()) {
>>            var f1 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s1, i)
>>                    .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>>                            IntVector.SPECIES_256, 0)
>>                    .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>>                    .reinterpretAsFloats();
>> 
>>            var f2 = ShortVector.fromArray(ShortVector.SPECIES_128, p.s2, i)
>>                    .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>>                            IntVector.SPECIES_256, 0)
>>                    .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>>                    .reinterpretAsFloats();
>> 
>>            acc = acc.add(f1.mul(f2));
>>        }
>> 
>>        return acc.reduceLanes(VectorOperators.ADD);
>>    }
>> 
>> 
>>> On Sep 18, 2023, at 8:45 AM, Jake Luciani <jake at apache.org> wrote:
>>> 
>>> Looking at the code I wonder if it's this extra branch?
>>> 
>>> @ForceInline
>>> final
>>> float reduceLanesTemplate(VectorOperators.Associative op,
>>>                          Class<? extends VectorMask<Float>> maskClass,
>>>                          VectorMask<Float> m) {
>>>   m.check(maskClass, this);
>>>   if (op == FIRST_NONZERO) {
>>>       // FIXME:  The JIT should handle this.
>>>       FloatVector v = broadcast((float) 0).blend(this, m);
>>>       return v.reduceLanesTemplate(op);
>>>   }
>>>   int opc = opCode(op);
>>>   return fromBits(VectorSupport.reductionCoerced(
>>>       opc, getClass(), maskClass, float.class, length(),
>>>       this, m,
>>>       REDUCE_IMPL.find(op, opc, FloatVector::reductionOperations)));
>>> }
>>> 
>>> On Mon, Sep 18, 2023 at 11:11 AM Andrii Lomakin
>>> <lomakin.andrey at gmail.com> wrote:
>>>> 
>>>> Hi,
>>>> I have the same problem during calculation of Eucledian distance in my project too.
>>>> Writing just to confirm that it is not a single case and I have got the same result during profiling.
>>>> 
>>>> On Sat, Sep 16, 2023 at 9:50 PM Jake Luciani <jake at apache.org> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I've been struggling with a problem recently using the vector api.
>>>>> It appears as reduceLanes is not using the intrinsic.
>>>>> 
>>>>>         ns  percent  samples  top
>>>>> ----------  -------  -------  ---
>>>>> 13240151836   88.21%     1324
>>>>> jdk.incubator.vector.FloatVector.reduceLanesTemplate
>>>>> 1349991099    8.99%      135
>>>>> jdk.incubator.vector.FloatVector.lanewiseTemplate
>>>>> 
>>>>> I've tested openjdk 20 and 21 and my machine has AVX512.
>>>>> 
>>>>> When I PrintIntrinsics I see the following (among others):
>>>>> 
>>>>> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>>>>> 
>>>>> I've included a JMH benchmark that reproduces the issue.
>>>>> 
>>>>> -Jake
>>>>> 
>>>>> import jdk.incubator.vector.FloatVector;
>>>>> import jdk.incubator.vector.IntVector;
>>>>> import jdk.incubator.vector.ShortVector;
>>>>> import jdk.incubator.vector.VectorOperators;
>>>>> import org.openjdk.jmh.annotations.*;
>>>>> import org.openjdk.jmh.infra.Blackhole;
>>>>> 
>>>>> import java.util.concurrent.ThreadLocalRandom;
>>>>> import java.util.concurrent.TimeUnit;
>>>>> 
>>>>> 
>>>>> @Warmup(iterations = 1, time = 5)
>>>>> @Measurement(iterations = 3, time = 5)
>>>>> @Fork(warmups = 1, value = 1, jvmArgsPrepend = {
>>>>>       "--add-modules=jdk.incubator.vector",
>>>>>       "--enable-preview"})
>>>>> public class VectorPerfBench
>>>>> {
>>>>>   private static final int SIZE = 8192;
>>>>>   private static final IntVector BF16_BYTE_SHIFT =
>>>>> IntVector.broadcast(IntVector.SPECIES_512, 16);
>>>>> 
>>>>>   public static short float32ToBFloat16(float f) {
>>>>>       return (short) (Float.floatToIntBits(f) >> 16);
>>>>>   }
>>>>>   @State(Scope.Benchmark)
>>>>>   public static class Parameters {
>>>>>       final short[] s1 = new short[SIZE];
>>>>>       final short[] s2 = new short[SIZE];
>>>>> 
>>>>>       public Parameters() {
>>>>>           for (int i = 0; i < SIZE; i++) {
>>>>>               s1[i] =
>>>>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
>>>>>               s2[i] =
>>>>> float32ToBFloat16(ThreadLocalRandom.current().nextFloat());
>>>>>           }
>>>>>       }
>>>>>   }
>>>>> 
>>>>>   @Benchmark
>>>>>   @OutputTimeUnit(TimeUnit.MILLISECONDS)
>>>>>   @BenchmarkMode(Mode.Throughput)
>>>>>   public void bfloatDot(Parameters p, Blackhole bh) {
>>>>>       FloatVector acc = FloatVector.zero(FloatVector.SPECIES_512);
>>>>>       for (int i = 0; i < SIZE; i += FloatVector.SPECIES_512.length()) {
>>>>> 
>>>>>           var f1 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s1, i)
>>>>>                   .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>>>>> IntVector.SPECIES_512, 0)
>>>>>                   .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>>>>>                   .reinterpretAsFloats();
>>>>> 
>>>>>           var f2 = ShortVector.fromArray(ShortVector.SPECIES_256, p.s2, i)
>>>>>                   .convertShape(VectorOperators.ZERO_EXTEND_S2I,
>>>>> IntVector.SPECIES_512, 0)
>>>>>                   .lanewise(VectorOperators.LSHL, BF16_BYTE_SHIFT)
>>>>>                   .reinterpretAsFloats();
>>>>> 
>>>>>           acc = acc.add(f1.mul(f2));
>>>>>       }
>>>>> 
>>>>>       bh.consume(acc.reduceLanes(VectorOperators.ADD));
>>>>>   }
>>>>> 
>>>>>   public static void main(String[] args) throws Exception {
>>>>>       org.openjdk.jmh.Main.main(args);
>>>>>   }
>>>>> }
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best regards,
>>>> Andrii Lomakin.
>>>> 
>>