Vector Intrinsics & Boxing Problems?

Tue Dec 22 23:19:07 UTC 2020

Thank you Vladimir,

I will try again with latest jdk 16.

On Tue, Dec 22, 2020, 3:03 PM Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
wrote:

> Hi August,
>
> Thanks a lot for the benchmarks.
>
> I gave it a try with latest jdk16 and observed the following numbers:
>
> mvp1                12145580.543 ±  462197.858   ops/s
> mvp1:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
>
> mvp2                11813978.510 ± 1063694.922   ops/s
> mvp2:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
>
> mvp3                  171990.456 ±   41828.282   ops/s
> mvp3:·gc.alloc.rate      566.613 ±     138.728  MB/sec
>
> The numbers for mvp1 and mvp2 are comparable and both benchmarks don't
> suffer from boxing. Additionally, I took a look at the inlining log and
> all the operations are inlined/intrinsified nicely.
>
> Regarding mvp1 vs mvp2 difference you see, I believe it is already fixed
> in the mainline by JDK-8257165 [1] and JDK-8257057 [2], but hasn't been
> merged into panama/vectorIntrinsics branch yet.
>
> Regarding mvp3, unfortunately, Vector::slice(int origin, Vector<E> v1)
> overload is not intrinsified yet and the call eventually ends in
> ByteVector::sliceTemplate() [3] which performs naive copy between the
> arrays backing vectors. So, it additionally suffers from box allocation
> overhead. Hopefully, it'll be fixed soon.
>
> Best regards,
> Vladimir Ivanov
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8257165
> [2] https://bugs.openjdk.java.net/browse/JDK-8257057
>
> [3]
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java:
>
>
>
>      /*package-private*/
>      final
>      @ForceInline
>      ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
>          ByteVector that = (ByteVector) v1;
>          that.check(this);
>          byte[] a0 = this.vec();
>          byte[] a1 = that.vec();
>          byte[] res = new byte[a0.length];
>          int vlen = res.length;
>          int firstPart = vlen - origin;
>          System.arraycopy(a0, origin, res, 0, firstPart);
>          System.arraycopy(a1, 0, res, firstPart, origin);
>          return vectorFactory(res);
>      }
>
> On 22.12.2020 13:55, August Nagro wrote:
> > Ok, I started benchmarking in JMH piece by piece, and can share two
> > findings so far:
> >
> > 1. A single branch in the loop decimates performance (way more than c++):
> >
> > 5745.155 Ops/sec:
> > @Benchmark
> > public boolean mvp() {
> >    VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >
> >    var res = ByteVector.zero(species);
> >    boolean hasNegs = false;
> >
> >    int i = 0;
> >    for (; i < species.loopBound(buf.length); i += species.length()) {
> >      var input = ByteVector.fromArray(species, buf, i);
> >      hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> >      if (hasNegs) {
> >        res = res.lanewise(OR, input);
> >      }
> >    }
> >
> >    return hasNegs && res.test(IS_DEFAULT).allTrue();
> > }
> >
> > Without a branch, 75733.575 Ops/sec:
> >
> > @Benchmark
> > public boolean mvp() {
> >    VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >
> >    var res = ByteVector.zero(species);
> >    boolean hasNegs = false;
> >
> >    int i = 0;
> >    for (; i < species.loopBound(buf.length); i += species.length()) {
> >      var input = ByteVector.fromArray(species, buf, i);
> >      hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> >      res = res.lanewise(OR, input);
> >    }
> >
> >    return hasNegs && res.test(IS_DEFAULT).allTrue();
> > }
> >
> >
> > 2. slice() is very slow, 2669.239 Ops/sec:
> > @Benchmark
> > public boolean mvp() {
> >      VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >      ByteVector x = ByteVector.zero(species);
> >
> >      int i = 0;
> >      for (; i < species.loopBound(buf.length); i += species.length()) {
> >        var input = ByteVector.fromArray(species, buf, i);
> >        x = x.slice(species.length() - 2, input);
> >      }
> >
> >      return x;
> > }
> >
> >
> > There were intrinsic failures for #1, but I must have been reading the
> > output wrong, because the same `@ <number>` shows up later and is
> > intensified.
> >
> > - August
> >
> > On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com>
> wrote:
> >>
> >> Hello,
> >>
> >> I've been playing around with the vector api, and am trying to debug
> >> some performance problems.
> >>
> >> I'm on Linux x86, and it seems like some of the ops on ByteVector128
> >> are not being intrinsified. When I print the assembly with
> >> -XX:+PrintAssembly, I noticed that the ymm registers are not being
> >> used much / were close together. After -XX:+PrintIntrinsics I can see
> >> that there are some failures among the successes:
> >>
> >> @ 245   jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
> >> failed to inline (intrinsic)
> >>
> >> @ 52   jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
> >> to inline (intrinsic)
> >>
> >> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
> >>                                @ 16
> >> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   failed to
> >> inline (intrinsic)
> >>
> >> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
> >>                                    @ 106   java.lang.Object::getClass
> >> (0 bytes)   (intrinsic)
> >>                                    @ 134
> >> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes)   failed
> >> to inline (intrinsic)
> >>
> >>
> >> Here is one hot code that may be a problem:
> >> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
> >> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
> >> `bits.compare(LT, 0)`. Eventually it reaches the intensified
> >> VectorSupport::compare.
> >>
> >> I have never looked at hotspot intrinsic code before so bear with me.
> >> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
> >> don't understand where to go from here. However, I did notice file
> >> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
> >> comparison constants in VectorSupport.java.
> >>
> >> Am I on the right path here? And finally, is there a way to tell if
> >> Vector boxing is occurring?
> >>
> >> Regards,
> >>
> >> August
>