Vector Intrinsics & Boxing Problems?

August Nagro augustnagro at gmail.com
Tue Dec 22 10:55:27 UTC 2020


Ok, I started benchmarking in JMH piece by piece, and can share two
findings so far:

1. A single branch in the loop decimates performance (way more than c++):

5745.155 Ops/sec:
@Benchmark
public boolean mvp() {
  VectorSpecies<Byte> species = ByteVector.SPECIES_128;

  var res = ByteVector.zero(species);
  boolean hasNegs = false;

  int i = 0;
  for (; i < species.loopBound(buf.length); i += species.length()) {
    var input = ByteVector.fromArray(species, buf, i);
    hasNegs = !input.test(IS_NEGATIVE).anyTrue();
    if (hasNegs) {
      res = res.lanewise(OR, input);
    }
  }

  return hasNegs && res.test(IS_DEFAULT).allTrue();
}

Without a branch, 75733.575 Ops/sec:

@Benchmark
public boolean mvp() {
  VectorSpecies<Byte> species = ByteVector.SPECIES_128;

  var res = ByteVector.zero(species);
  boolean hasNegs = false;

  int i = 0;
  for (; i < species.loopBound(buf.length); i += species.length()) {
    var input = ByteVector.fromArray(species, buf, i);
    hasNegs = !input.test(IS_NEGATIVE).anyTrue();
    res = res.lanewise(OR, input);
  }

  return hasNegs && res.test(IS_DEFAULT).allTrue();
}


2. slice() is very slow, 2669.239 Ops/sec:
@Benchmark
public boolean mvp() {
    VectorSpecies<Byte> species = ByteVector.SPECIES_128;
    ByteVector x = ByteVector.zero(species);

    int i = 0;
    for (; i < species.loopBound(buf.length); i += species.length()) {
      var input = ByteVector.fromArray(species, buf, i);
      x = x.slice(species.length() - 2, input);
    }

    return x;
}


There were intrinsic failures for #1, but I must have been reading the
output wrong, because the same `@ <number>` shows up later and is
intensified.

- August

On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com> wrote:
>
> Hello,
>
> I've been playing around with the vector api, and am trying to debug
> some performance problems.
>
> I'm on Linux x86, and it seems like some of the ops on ByteVector128
> are not being intrinsified. When I print the assembly with
> -XX:+PrintAssembly, I noticed that the ymm registers are not being
> used much / were close together. After -XX:+PrintIntrinsics I can see
> that there are some failures among the successes:
>
> @ 245   jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
> failed to inline (intrinsic)
>
> @ 52   jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
> to inline (intrinsic)
>
> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
>                               @ 16
> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   failed to
> inline (intrinsic)
>
> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>                                   @ 106   java.lang.Object::getClass
> (0 bytes)   (intrinsic)
>                                   @ 134
> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes)   failed
> to inline (intrinsic)
>
>
> Here is one hot code that may be a problem:
> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
> `bits.compare(LT, 0)`. Eventually it reaches the intensified
> VectorSupport::compare.
>
> I have never looked at hotspot intrinsic code before so bear with me.
> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
> don't understand where to go from here. However, I did notice file
> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
> comparison constants in VectorSupport.java.
>
> Am I on the right path here? And finally, is there a way to tell if
> Vector boxing is occurring?
>
> Regards,
>
> August


More information about the panama-dev mailing list