Vector Intrinsics & Boxing Problems?

Wed Dec 23 05:48:26 UTC 2020

I built JDK 17 from the latest sources and can still reproduce the
problem. I've updated the benchmark and attached it in full. (Feel
free to ignore until after the holidays.. I am bored & stuck far from
family)

Benchmark
(branchPrediction)   Mode  Cnt        Score      Error  Units
OneBranchTooMany.scalar                                  ZEROS  thrpt
  5        1135665.796 ± 1766.706  ops/s
OneBranchTooMany.scalar                                  TRIVIAL
thrpt    5       1135924.033 ± 1015.026  ops/s
OneBranchTooMany.scalar                                  HARDER  thrpt
   5     1136232.871 ±  586.668  ops/s
OneBranchTooMany.scalar                                  HARDEST
thrpt    5     612141.570 ± 1245.990  ops/s
OneBranchTooMany.vectorBranchFree               ZEROS  thrpt    5
      40475.167 ±   11.238  ops/s
OneBranchTooMany.vectorBranchFree               TRIVIAL  thrpt    5
       39494.511 ± 1322.217  ops/s
OneBranchTooMany.vectorBranchFree               HARDER  thrpt    5
    95436.486 ±  184.308  ops/s
OneBranchTooMany.vectorBranchFree               HARDEST  thrpt    5
   90152.077 ±  343.779  ops/s
OneBranchTooMany.vectorBranchy                     ZEROS  thrpt    5
        40461.448 ±    4.454  ops/s
OneBranchTooMany.vectorBranchy                     TRIVIAL  thrpt    5
         40367.820 ±  427.571  ops/s
OneBranchTooMany.vectorBranchy                     HARDER  thrpt    5
        5975.173 ±  241.141  ops/s
OneBranchTooMany.vectorBranchy                     HARDEST  thrpt    5
       6834.446 ±  195.880  ops/s

The byte array size is 160_000, which equals 5_000 128 bit ByteVectors.

ZEROS is when the array is all 0's
TRIVIAL is when the array is random but branch is always taken
HARDER is when there's a 1/100 chance the branch is not taken
HARDEST is when there's a 1/2 chance.

Please ignore the difference between scala & vector. I included scalar
only to see that, as expected, scalar performance is constant until
the branch becomes very hard to predict.

The branch-free and branchy implementations are about the same in the
ZEROS and TRIVIAL case. Somewhat shockingly, vectorBranchFree *gets 2x
faster* in the HARDER and HARDEST cases. Meanwhile vectorBranchy flogs
itself.

Here is a complete JMH code:

import jdk.incubator.vector.ByteVector;
import jdk.incubator.vector.VectorSpecies;
import org.openjdk.jmh.annotations.*;

import java.util.Arrays;
import java.util.Random;

import static jdk.incubator.vector.VectorOperators.*;

@BenchmarkMode(Mode.Throughput)
@State(Scope.Benchmark)
@Measurement(time = 1, iterations = 5)
@Fork(
    value = 1,
    jvmArgsPrepend = {
        "--enable-preview",
        "--add-modules=jdk.incubator.vector",
//        "-XX:+UnlockDiagnosticVMOptions",
//        "-XX:+PrintIntrinsics",
//        "-XX:-TieredCompilation"
    }
)
public class OneBranchTooManyOnCharliesTree {

  private static final long SEED = 7644497507502289649L;
  private static final VectorSpecies<Byte> SPECIES = ByteVector.SPECIES_128;

  @Param({"ZEROS", "TRIVIAL", "HARDER", "HARDEST"})
  String branchPrediction;

  byte[] buf;

  @Setup(Level.Trial)
  public void setup() {
    // 5_000 128 bit vectors
    buf = new byte[640_000];

    // 0/100 chance the bytes of a vector are negative
    if ("TRIVIAL".equals(branchPrediction)) {
      var r = new Random(SEED);
      for (int i = 0; i < buf.length; i += 128) {
        byte b = (byte) r.nextInt(128);
        Arrays.fill(buf, i, i + 128, b);
      }

    // 1/100 chance the bytes of a vector are negative
    // harder to predict than all 0's
    } else if ("HARDER".equals(branchPrediction)) {
      var r = new Random(SEED);
      for (int i = 0; i < buf.length; i += 128) {
        byte b = (byte) r.nextInt(128);
        if (r.nextInt(100) == 0) b = (byte) -b;
        Arrays.fill(buf, i, i + 128, b);
      }

    // 1/2 chance the bytes of a vector are negative
    // hardest / impossible to predict.
    } else if ("HARDEST".equals(branchPrediction)) {
      var r = new Random(SEED);
      for (int i = 0; i < buf.length; i += 128) {
        byte b = (byte) r.nextInt(128);
        if (r.nextBoolean()) b = (byte) -b;
        Arrays.fill(buf, i, i + 128, b);
      }
    }
  }

  @Benchmark
  public boolean scalar() {
    boolean isAllPositive = false;

    for (int i = 0; i < 5_000; i += 128) {
      boolean vectorHasNegatives = false;
      for (int j = 0; j < 128; j++) {
        if (buf[i + j] < 0) vectorHasNegatives = true;
      }
      isAllPositive = !vectorHasNegatives;
    }

    return isAllPositive;
  }

  @Benchmark
  public boolean vectorBranchFree() {
    var res = ByteVector.zero(SPECIES);
    boolean isAllPositive = false;

    int i = 0;
    for (; i < SPECIES.loopBound(buf.length); i += SPECIES.length()) {
      var input = ByteVector.fromArray(SPECIES, buf, i);
      isAllPositive = !input.test(IS_NEGATIVE).anyTrue();
      res = res.lanewise(OR, input);
    }

    return isAllPositive && res.test(IS_DEFAULT).allTrue();
  }

  @Benchmark
  public boolean vectorBranchy() {
    var res = ByteVector.zero(SPECIES);
    boolean isAllPositive = false;

    int i = 0;
    for (; i < SPECIES.loopBound(buf.length); i += SPECIES.length()) {
      var input = ByteVector.fromArray(SPECIES, buf, i);
      isAllPositive = !input.test(IS_NEGATIVE).anyTrue();
      if (isAllPositive) {
        res = res.lanewise(OR, input);
      }
    }

    return isAllPositive && res.test(IS_DEFAULT).allTrue();
  }
}

I have tried a few different times and with different SEEDs too.
Curious to know if you can reproduce.

Regards,

August

On Tue, Dec 22, 2020 at 3:19 PM August Nagro <augustnagro at gmail.com> wrote:
>
> Thank you Vladimir,
>
> I will try again with latest jdk 16.
>
>
> On Tue, Dec 22, 2020, 3:03 PM Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>
>> Hi August,
>>
>> Thanks a lot for the benchmarks.
>>
>> I gave it a try with latest jdk16 and observed the following numbers:
>>
>> mvp1                12145580.543 ±  462197.858   ops/s
>> mvp1:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
>>
>> mvp2                11813978.510 ± 1063694.922   ops/s
>> mvp2:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
>>
>> mvp3                  171990.456 ±   41828.282   ops/s
>> mvp3:·gc.alloc.rate      566.613 ±     138.728  MB/sec
>>
>> The numbers for mvp1 and mvp2 are comparable and both benchmarks don't
>> suffer from boxing. Additionally, I took a look at the inlining log and
>> all the operations are inlined/intrinsified nicely.
>>
>> Regarding mvp1 vs mvp2 difference you see, I believe it is already fixed
>> in the mainline by JDK-8257165 [1] and JDK-8257057 [2], but hasn't been
>> merged into panama/vectorIntrinsics branch yet.
>>
>> Regarding mvp3, unfortunately, Vector::slice(int origin, Vector<E> v1)
>> overload is not intrinsified yet and the call eventually ends in
>> ByteVector::sliceTemplate() [3] which performs naive copy between the
>> arrays backing vectors. So, it additionally suffers from box allocation
>> overhead. Hopefully, it'll be fixed soon.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8257165
>> [2] https://bugs.openjdk.java.net/browse/JDK-8257057
>>
>> [3]
>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java:
>>
>>
>>      /*package-private*/
>>      final
>>      @ForceInline
>>      ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
>>          ByteVector that = (ByteVector) v1;
>>          that.check(this);
>>          byte[] a0 = this.vec();
>>          byte[] a1 = that.vec();
>>          byte[] res = new byte[a0.length];
>>          int vlen = res.length;
>>          int firstPart = vlen - origin;
>>          System.arraycopy(a0, origin, res, 0, firstPart);
>>          System.arraycopy(a1, 0, res, firstPart, origin);
>>          return vectorFactory(res);
>>      }
>>
>> On 22.12.2020 13:55, August Nagro wrote:
>> > Ok, I started benchmarking in JMH piece by piece, and can share two
>> > findings so far:
>> >
>> > 1. A single branch in the loop decimates performance (way more than c++):
>> >
>> > 5745.155 Ops/sec:
>> > @Benchmark
>> > public boolean mvp() {
>> >    VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>> >
>> >    var res = ByteVector.zero(species);
>> >    boolean hasNegs = false;
>> >
>> >    int i = 0;
>> >    for (; i < species.loopBound(buf.length); i += species.length()) {
>> >      var input = ByteVector.fromArray(species, buf, i);
>> >      hasNegs = !input.test(IS_NEGATIVE).anyTrue();
>> >      if (hasNegs) {
>> >        res = res.lanewise(OR, input);
>> >      }
>> >    }
>> >
>> >    return hasNegs && res.test(IS_DEFAULT).allTrue();
>> > }
>> >
>> > Without a branch, 75733.575 Ops/sec:
>> >
>> > @Benchmark
>> > public boolean mvp() {
>> >    VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>> >
>> >    var res = ByteVector.zero(species);
>> >    boolean hasNegs = false;
>> >
>> >    int i = 0;
>> >    for (; i < species.loopBound(buf.length); i += species.length()) {
>> >      var input = ByteVector.fromArray(species, buf, i);
>> >      hasNegs = !input.test(IS_NEGATIVE).anyTrue();
>> >      res = res.lanewise(OR, input);
>> >    }
>> >
>> >    return hasNegs && res.test(IS_DEFAULT).allTrue();
>> > }
>> >
>> >
>> > 2. slice() is very slow, 2669.239 Ops/sec:
>> > @Benchmark
>> > public boolean mvp() {
>> >      VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>> >      ByteVector x = ByteVector.zero(species);
>> >
>> >      int i = 0;
>> >      for (; i < species.loopBound(buf.length); i += species.length()) {
>> >        var input = ByteVector.fromArray(species, buf, i);
>> >        x = x.slice(species.length() - 2, input);
>> >      }
>> >
>> >      return x;
>> > }
>> >
>> >
>> > There were intrinsic failures for #1, but I must have been reading the
>> > output wrong, because the same `@ <number>` shows up later and is
>> > intensified.
>> >
>> > - August
>> >
>> > On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I've been playing around with the vector api, and am trying to debug
>> >> some performance problems.
>> >>
>> >> I'm on Linux x86, and it seems like some of the ops on ByteVector128
>> >> are not being intrinsified. When I print the assembly with
>> >> -XX:+PrintAssembly, I noticed that the ymm registers are not being
>> >> used much / were close together. After -XX:+PrintIntrinsics I can see
>> >> that there are some failures among the successes:
>> >>
>> >> @ 245   jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
>> >> failed to inline (intrinsic)
>> >>
>> >> @ 52   jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
>> >> to inline (intrinsic)
>> >>
>> >> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
>> >>                                @ 16
>> >> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   failed to
>> >> inline (intrinsic)
>> >>
>> >> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>> >>                                    @ 106   java.lang.Object::getClass
>> >> (0 bytes)   (intrinsic)
>> >>                                    @ 134
>> >> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes)   failed
>> >> to inline (intrinsic)
>> >>
>> >>
>> >> Here is one hot code that may be a problem:
>> >> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
>> >> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
>> >> `bits.compare(LT, 0)`. Eventually it reaches the intensified
>> >> VectorSupport::compare.
>> >>
>> >> I have never looked at hotspot intrinsic code before so bear with me.
>> >> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
>> >> don't understand where to go from here. However, I did notice file
>> >> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
>> >> comparison constants in VectorSupport.java.
>> >>
>> >> Am I on the right path here? And finally, is there a way to tell if
>> >> Vector boxing is occurring?
>> >>
>> >> Regards,
>> >>
>> >> August