Vector Intrinsics & Boxing Problems?
August Nagro
augustnagro at gmail.com
Tue Dec 22 23:19:07 UTC 2020
Thank you Vladimir,
I will try again with latest jdk 16.
On Tue, Dec 22, 2020, 3:03 PM Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
wrote:
> Hi August,
>
> Thanks a lot for the benchmarks.
>
> I gave it a try with latest jdk16 and observed the following numbers:
>
> mvp1 12145580.543 ± 462197.858 ops/s
> mvp1:·gc.alloc.rate ≈ 10⁻⁴ MB/sec
>
> mvp2 11813978.510 ± 1063694.922 ops/s
> mvp2:·gc.alloc.rate ≈ 10⁻⁴ MB/sec
>
> mvp3 171990.456 ± 41828.282 ops/s
> mvp3:·gc.alloc.rate 566.613 ± 138.728 MB/sec
>
> The numbers for mvp1 and mvp2 are comparable and both benchmarks don't
> suffer from boxing. Additionally, I took a look at the inlining log and
> all the operations are inlined/intrinsified nicely.
>
> Regarding mvp1 vs mvp2 difference you see, I believe it is already fixed
> in the mainline by JDK-8257165 [1] and JDK-8257057 [2], but hasn't been
> merged into panama/vectorIntrinsics branch yet.
>
> Regarding mvp3, unfortunately, Vector::slice(int origin, Vector<E> v1)
> overload is not intrinsified yet and the call eventually ends in
> ByteVector::sliceTemplate() [3] which performs naive copy between the
> arrays backing vectors. So, it additionally suffers from box allocation
> overhead. Hopefully, it'll be fixed soon.
>
> Best regards,
> Vladimir Ivanov
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8257165
> [2] https://bugs.openjdk.java.net/browse/JDK-8257057
>
> [3]
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java:
>
>
>
> /*package-private*/
> final
> @ForceInline
> ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
> ByteVector that = (ByteVector) v1;
> that.check(this);
> byte[] a0 = this.vec();
> byte[] a1 = that.vec();
> byte[] res = new byte[a0.length];
> int vlen = res.length;
> int firstPart = vlen - origin;
> System.arraycopy(a0, origin, res, 0, firstPart);
> System.arraycopy(a1, 0, res, firstPart, origin);
> return vectorFactory(res);
> }
>
> On 22.12.2020 13:55, August Nagro wrote:
> > Ok, I started benchmarking in JMH piece by piece, and can share two
> > findings so far:
> >
> > 1. A single branch in the loop decimates performance (way more than c++):
> >
> > 5745.155 Ops/sec:
> > @Benchmark
> > public boolean mvp() {
> > VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >
> > var res = ByteVector.zero(species);
> > boolean hasNegs = false;
> >
> > int i = 0;
> > for (; i < species.loopBound(buf.length); i += species.length()) {
> > var input = ByteVector.fromArray(species, buf, i);
> > hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> > if (hasNegs) {
> > res = res.lanewise(OR, input);
> > }
> > }
> >
> > return hasNegs && res.test(IS_DEFAULT).allTrue();
> > }
> >
> > Without a branch, 75733.575 Ops/sec:
> >
> > @Benchmark
> > public boolean mvp() {
> > VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >
> > var res = ByteVector.zero(species);
> > boolean hasNegs = false;
> >
> > int i = 0;
> > for (; i < species.loopBound(buf.length); i += species.length()) {
> > var input = ByteVector.fromArray(species, buf, i);
> > hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> > res = res.lanewise(OR, input);
> > }
> >
> > return hasNegs && res.test(IS_DEFAULT).allTrue();
> > }
> >
> >
> > 2. slice() is very slow, 2669.239 Ops/sec:
> > @Benchmark
> > public boolean mvp() {
> > VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> > ByteVector x = ByteVector.zero(species);
> >
> > int i = 0;
> > for (; i < species.loopBound(buf.length); i += species.length()) {
> > var input = ByteVector.fromArray(species, buf, i);
> > x = x.slice(species.length() - 2, input);
> > }
> >
> > return x;
> > }
> >
> >
> > There were intrinsic failures for #1, but I must have been reading the
> > output wrong, because the same `@ <number>` shows up later and is
> > intensified.
> >
> > - August
> >
> > On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com>
> wrote:
> >>
> >> Hello,
> >>
> >> I've been playing around with the vector api, and am trying to debug
> >> some performance problems.
> >>
> >> I'm on Linux x86, and it seems like some of the ops on ByteVector128
> >> are not being intrinsified. When I print the assembly with
> >> -XX:+PrintAssembly, I noticed that the ymm registers are not being
> >> used much / were close together. After -XX:+PrintIntrinsics I can see
> >> that there are some failures among the successes:
> >>
> >> @ 245 jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
> >> failed to inline (intrinsic)
> >>
> >> @ 52 jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
> >> to inline (intrinsic)
> >>
> >> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
> >> @ 16
> >> jdk.internal.vm.vector.VectorSupport::extract (35 bytes) failed to
> >> inline (intrinsic)
> >>
> >> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
> >> @ 106 java.lang.Object::getClass
> >> (0 bytes) (intrinsic)
> >> @ 134
> >> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes) failed
> >> to inline (intrinsic)
> >>
> >>
> >> Here is one hot code that may be a problem:
> >> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
> >> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
> >> `bits.compare(LT, 0)`. Eventually it reaches the intensified
> >> VectorSupport::compare.
> >>
> >> I have never looked at hotspot intrinsic code before so bear with me.
> >> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
> >> don't understand where to go from here. However, I did notice file
> >> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
> >> comparison constants in VectorSupport.java.
> >>
> >> Am I on the right path here? And finally, is there a way to tell if
> >> Vector boxing is occurring?
> >>
> >> Regards,
> >>
> >> August
>
More information about the panama-dev
mailing list