Vector Intrinsics & Boxing Problems?

Wed Dec 23 21:14:43 UTC 2020

Awesome, I can confirm the patch works!

I will continue to test.

On Wed, Dec 23, 2020 at 8:02 AM Vladimir Ivanov <
vladimir.x.ivanov at oracle.com> wrote:
>
> Hi August,
>
> > The branch-free and branchy implementations are about the same in the
> > ZEROS and TRIVIAL case. Somewhat shockingly, vectorBranchFree *gets 2x
>
> Interesting observation. I spotted that isAllPositive is not used inside
> the loop. Maybe C2 can eliminate it from the loop body and that leads to
> the observed speedup.
>
> > faster* in the HARDER and HARDEST cases. Meanwhile vectorBranchy flogs
> > itself.
>
> Good, I'm able to reproduce it.
>
> Please, give the following patch a try:
>    https://github.com/openjdk/jdk/compare/master...iwanowww:vector.phi
>
> With the patch applied I don't see any allocations when running your
> benchmarks:
>    vectorBranchFree    ZEROS  18578.763 ± 1373.876   ops/s
>    vectorBranchFree  TRIVIAL  18710.676 ± 1147.767   ops/s
>    vectorBranchFree   HARDER  38074.971 ± 4124.170   ops/s
>    vectorBranchFree  HARDEST  36519.474 ± 1314.910   ops/s
>
>    vectorBranchy       ZEROS  18760.688 ± 1043.267   ops/s
>    vectorBranchy     TRIVIAL  17901.663 ± 1304.536   ops/s
>    vectorBranchy      HARDER  17073.734 ± 1832.655   ops/s
>    vectorBranchy     HARDEST  12601.917 ±  778.448   ops/s
>
> Best regards,
> Vladimir Ivanov
>
> >
> > Here is a complete JMH code:
> >
> > import jdk.incubator.vector.ByteVector;
> > import jdk.incubator.vector.VectorSpecies;
> > import org.openjdk.jmh.annotations.*;
> >
> > import java.util.Arrays;
> > import java.util.Random;
> >
> > import static jdk.incubator.vector.VectorOperators.*;
> >
> > @BenchmarkMode(Mode.Throughput)
> > @State(Scope.Benchmark)
> > @Measurement(time = 1, iterations = 5)
> > @Fork(
> >      value = 1,
> >      jvmArgsPrepend = {
> >          "--enable-preview",
> >          "--add-modules=jdk.incubator.vector",
> > //        "-XX:+UnlockDiagnosticVMOptions",
> > //        "-XX:+PrintIntrinsics",
> > //        "-XX:-TieredCompilation"
> >      }
> > )
> > public class OneBranchTooManyOnCharliesTree {
> >
> >    private static final long SEED = 7644497507502289649L;
> >    private static final VectorSpecies<Byte> SPECIES =
ByteVector.SPECIES_128;
> >
> >    @Param({"ZEROS", "TRIVIAL", "HARDER", "HARDEST"})
> >    String branchPrediction;
> >
> >    byte[] buf;
> >
> >    @Setup(Level.Trial)
> >    public void setup() {
> >      // 5_000 128 bit vectors
> >      buf = new byte[640_000];
> >
> >      // 0/100 chance the bytes of a vector are negative
> >      if ("TRIVIAL".equals(branchPrediction)) {
> >        var r = new Random(SEED);
> >        for (int i = 0; i < buf.length; i += 128) {
> >          byte b = (byte) r.nextInt(128);
> >          Arrays.fill(buf, i, i + 128, b);
> >        }
> >
> >      // 1/100 chance the bytes of a vector are negative
> >      // harder to predict than all 0's
> >      } else if ("HARDER".equals(branchPrediction)) {
> >        var r = new Random(SEED);
> >        for (int i = 0; i < buf.length; i += 128) {
> >          byte b = (byte) r.nextInt(128);
> >          if (r.nextInt(100) == 0) b = (byte) -b;
> >          Arrays.fill(buf, i, i + 128, b);
> >        }
> >
> >      // 1/2 chance the bytes of a vector are negative
> >      // hardest / impossible to predict.
> >      } else if ("HARDEST".equals(branchPrediction)) {
> >        var r = new Random(SEED);
> >        for (int i = 0; i < buf.length; i += 128) {
> >          byte b = (byte) r.nextInt(128);
> >          if (r.nextBoolean()) b = (byte) -b;
> >          Arrays.fill(buf, i, i + 128, b);
> >        }
> >      }
> >    }
> >
> >    @Benchmark
> >    public boolean scalar() {
> >      boolean isAllPositive = false;
> >
> >      for (int i = 0; i < 5_000; i += 128) {
> >        boolean vectorHasNegatives = false;
> >        for (int j = 0; j < 128; j++) {
> >          if (buf[i + j] < 0) vectorHasNegatives = true;
> >        }
> >        isAllPositive = !vectorHasNegatives;
> >      }
> >
> >      return isAllPositive;
> >    }
> >
> >    @Benchmark
> >    public boolean vectorBranchFree() {
> >      var res = ByteVector.zero(SPECIES);
> >      boolean isAllPositive = false;
> >
> >      int i = 0;
> >      for (; i < SPECIES.loopBound(buf.length); i += SPECIES.length()) {
> >        var input = ByteVector.fromArray(SPECIES, buf, i);
> >        isAllPositive = !input.test(IS_NEGATIVE).anyTrue();
> >        res = res.lanewise(OR, input);
> >      }
> >
> >      return isAllPositive && res.test(IS_DEFAULT).allTrue();
> >    }
> >
> >    @Benchmark
> >    public boolean vectorBranchy() {
> >      var res = ByteVector.zero(SPECIES);
> >      boolean isAllPositive = false;
> >
> >      int i = 0;
> >      for (; i < SPECIES.loopBound(buf.length); i += SPECIES.length()) {
> >        var input = ByteVector.fromArray(SPECIES, buf, i);
> >        isAllPositive = !input.test(IS_NEGATIVE).anyTrue();
> >        if (isAllPositive) {
> >          res = res.lanewise(OR, input);
> >        }
> >      }
> >
> >      return isAllPositive && res.test(IS_DEFAULT).allTrue();
> >    }
> > }
> >
> > I have tried a few different times and with different SEEDs too.
> > Curious to know if you can reproduce.
> >
> > Regards,
> >
> > August
> >
> > On Tue, Dec 22, 2020 at 3:19 PM August Nagro <augustnagro at gmail.com>
wrote:
> >>
> >> Thank you Vladimir,
> >>
> >> I will try again with latest jdk 16.
> >>
> >>
> >> On Tue, Dec 22, 2020, 3:03 PM Vladimir Ivanov <
vladimir.x.ivanov at oracle.com> wrote:
> >>>
> >>> Hi August,
> >>>
> >>> Thanks a lot for the benchmarks.
> >>>
> >>> I gave it a try with latest jdk16 and observed the following numbers:
> >>>
> >>> mvp1                12145580.543 ±  462197.858   ops/s
> >>> mvp1:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
> >>>
> >>> mvp2                11813978.510 ± 1063694.922   ops/s
> >>> mvp2:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
> >>>
> >>> mvp3                  171990.456 ±   41828.282   ops/s
> >>> mvp3:·gc.alloc.rate      566.613 ±     138.728  MB/sec
> >>>
> >>> The numbers for mvp1 and mvp2 are comparable and both benchmarks don't
> >>> suffer from boxing. Additionally, I took a look at the inlining log
and
> >>> all the operations are inlined/intrinsified nicely.
> >>>
> >>> Regarding mvp1 vs mvp2 difference you see, I believe it is already
fixed
> >>> in the mainline by JDK-8257165 [1] and JDK-8257057 [2], but hasn't
been
> >>> merged into panama/vectorIntrinsics branch yet.
> >>>
> >>> Regarding mvp3, unfortunately, Vector::slice(int origin, Vector<E> v1)
> >>> overload is not intrinsified yet and the call eventually ends in
> >>> ByteVector::sliceTemplate() [3] which performs naive copy between the
> >>> arrays backing vectors. So, it additionally suffers from box
allocation
> >>> overhead. Hopefully, it'll be fixed soon.
> >>>
> >>> Best regards,
> >>> Vladimir Ivanov
> >>>
> >>> [1] https://bugs.openjdk.java.net/browse/JDK-8257165
> >>> [2] https://bugs.openjdk.java.net/browse/JDK-8257057
> >>>
> >>> [3]
> >>>
src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java:
> >>>
> >>>
> >>>       /*package-private*/
> >>>       final
> >>>       @ForceInline
> >>>       ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
> >>>           ByteVector that = (ByteVector) v1;
> >>>           that.check(this);
> >>>           byte[] a0 = this.vec();
> >>>           byte[] a1 = that.vec();
> >>>           byte[] res = new byte[a0.length];
> >>>           int vlen = res.length;
> >>>           int firstPart = vlen - origin;
> >>>           System.arraycopy(a0, origin, res, 0, firstPart);
> >>>           System.arraycopy(a1, 0, res, firstPart, origin);
> >>>           return vectorFactory(res);
> >>>       }
> >>>
> >>> On 22.12.2020 13:55, August Nagro wrote:
> >>>> Ok, I started benchmarking in JMH piece by piece, and can share two
> >>>> findings so far:
> >>>>
> >>>> 1. A single branch in the loop decimates performance (way more than
c++):
> >>>>
> >>>> 5745.155 Ops/sec:
> >>>> @Benchmark
> >>>> public boolean mvp() {
> >>>>     VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >>>>
> >>>>     var res = ByteVector.zero(species);
> >>>>     boolean hasNegs = false;
> >>>>
> >>>>     int i = 0;
> >>>>     for (; i < species.loopBound(buf.length); i += species.length())
{
> >>>>       var input = ByteVector.fromArray(species, buf, i);
> >>>>       hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> >>>>       if (hasNegs) {
> >>>>         res = res.lanewise(OR, input);
> >>>>       }
> >>>>     }
> >>>>
> >>>>     return hasNegs && res.test(IS_DEFAULT).allTrue();
> >>>> }
> >>>>
> >>>> Without a branch, 75733.575 Ops/sec:
> >>>>
> >>>> @Benchmark
> >>>> public boolean mvp() {
> >>>>     VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >>>>
> >>>>     var res = ByteVector.zero(species);
> >>>>     boolean hasNegs = false;
> >>>>
> >>>>     int i = 0;
> >>>>     for (; i < species.loopBound(buf.length); i += species.length())
{
> >>>>       var input = ByteVector.fromArray(species, buf, i);
> >>>>       hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> >>>>       res = res.lanewise(OR, input);
> >>>>     }
> >>>>
> >>>>     return hasNegs && res.test(IS_DEFAULT).allTrue();
> >>>> }
> >>>>
> >>>>
> >>>> 2. slice() is very slow, 2669.239 Ops/sec:
> >>>> @Benchmark
> >>>> public boolean mvp() {
> >>>>       VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> >>>>       ByteVector x = ByteVector.zero(species);
> >>>>
> >>>>       int i = 0;
> >>>>       for (; i < species.loopBound(buf.length); i +=
species.length()) {
> >>>>         var input = ByteVector.fromArray(species, buf, i);
> >>>>         x = x.slice(species.length() - 2, input);
> >>>>       }
> >>>>
> >>>>       return x;
> >>>> }
> >>>>
> >>>>
> >>>> There were intrinsic failures for #1, but I must have been reading
the
> >>>> output wrong, because the same `@ <number>` shows up later and is
> >>>> intensified.
> >>>>
> >>>> - August
> >>>>
> >>>> On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com>
wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> I've been playing around with the vector api, and am trying to debug
> >>>>> some performance problems.
> >>>>>
> >>>>> I'm on Linux x86, and it seems like some of the ops on ByteVector128
> >>>>> are not being intrinsified. When I print the assembly with
> >>>>> -XX:+PrintAssembly, I noticed that the ymm registers are not being
> >>>>> used much / were close together. After -XX:+PrintIntrinsics I can
see
> >>>>> that there are some failures among the successes:
> >>>>>
> >>>>> @ 245   jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
> >>>>> failed to inline (intrinsic)
> >>>>>
> >>>>> @ 52   jdk.internal.vm.vector.VectorSupport::compare (40 bytes)
failed
> >>>>> to inline (intrinsic)
> >>>>>
> >>>>> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
> >>>>>                                 @ 16
> >>>>> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   failed to
> >>>>> inline (intrinsic)
> >>>>>
> >>>>> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
> >>>>>                                     @ 106
 java.lang.Object::getClass
> >>>>> (0 bytes)   (intrinsic)
> >>>>>                                     @ 134
> >>>>> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes)
 failed
> >>>>> to inline (intrinsic)
> >>>>>
> >>>>>
> >>>>> Here is one hot code that may be a problem:
> >>>>> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
> >>>>> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
> >>>>> `bits.compare(LT, 0)`. Eventually it reaches the intensified
> >>>>> VectorSupport::compare.
> >>>>>
> >>>>> I have never looked at hotspot intrinsic code before so bear with
me.
> >>>>> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
> >>>>> don't understand where to go from here. However, I did notice file
> >>>>> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
> >>>>> comparison constants in VectorSupport.java.
> >>>>>
> >>>>> Am I on the right path here? And finally, is there a way to tell if
> >>>>> Vector boxing is occurring?
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> August