Vector Intrinsics & Boxing Problems?

Wed Dec 23 16:02:05 UTC 2020

Hi August,

> The branch-free and branchy implementations are about the same in the
> ZEROS and TRIVIAL case. Somewhat shockingly, vectorBranchFree *gets 2x

Interesting observation. I spotted that isAllPositive is not used inside 
the loop. Maybe C2 can eliminate it from the loop body and that leads to 
the observed speedup.

> faster* in the HARDER and HARDEST cases. Meanwhile vectorBranchy flogs
> itself.

Good, I'm able to reproduce it.

Please, give the following patch a try:
   https://github.com/openjdk/jdk/compare/master...iwanowww:vector.phi

With the patch applied I don't see any allocations when running your 
benchmarks:
   vectorBranchFree    ZEROS  18578.763 ± 1373.876   ops/s
   vectorBranchFree  TRIVIAL  18710.676 ± 1147.767   ops/s
   vectorBranchFree   HARDER  38074.971 ± 4124.170   ops/s
   vectorBranchFree  HARDEST  36519.474 ± 1314.910   ops/s

   vectorBranchy       ZEROS  18760.688 ± 1043.267   ops/s
   vectorBranchy     TRIVIAL  17901.663 ± 1304.536   ops/s
   vectorBranchy      HARDER  17073.734 ± 1832.655   ops/s
   vectorBranchy     HARDEST  12601.917 ±  778.448   ops/s

Best regards,
Vladimir Ivanov

> 
> Here is a complete JMH code:
> 
> import jdk.incubator.vector.ByteVector;
> import jdk.incubator.vector.VectorSpecies;
> import org.openjdk.jmh.annotations.*;
> 
> import java.util.Arrays;
> import java.util.Random;
> 
> import static jdk.incubator.vector.VectorOperators.*;
> 
> @BenchmarkMode(Mode.Throughput)
> @State(Scope.Benchmark)
> @Measurement(time = 1, iterations = 5)
> @Fork(
>      value = 1,
>      jvmArgsPrepend = {
>          "--enable-preview",
>          "--add-modules=jdk.incubator.vector",
> //        "-XX:+UnlockDiagnosticVMOptions",
> //        "-XX:+PrintIntrinsics",
> //        "-XX:-TieredCompilation"
>      }
> )
> public class OneBranchTooManyOnCharliesTree {
> 
>    private static final long SEED = 7644497507502289649L;
>    private static final VectorSpecies<Byte> SPECIES = ByteVector.SPECIES_128;
> 
>    @Param({"ZEROS", "TRIVIAL", "HARDER", "HARDEST"})
>    String branchPrediction;
> 
>    byte[] buf;
> 
>    @Setup(Level.Trial)
>    public void setup() {
>      // 5_000 128 bit vectors
>      buf = new byte[640_000];
> 
>      // 0/100 chance the bytes of a vector are negative
>      if ("TRIVIAL".equals(branchPrediction)) {
>        var r = new Random(SEED);
>        for (int i = 0; i < buf.length; i += 128) {
>          byte b = (byte) r.nextInt(128);
>          Arrays.fill(buf, i, i + 128, b);
>        }
> 
>      // 1/100 chance the bytes of a vector are negative
>      // harder to predict than all 0's
>      } else if ("HARDER".equals(branchPrediction)) {
>        var r = new Random(SEED);
>        for (int i = 0; i < buf.length; i += 128) {
>          byte b = (byte) r.nextInt(128);
>          if (r.nextInt(100) == 0) b = (byte) -b;
>          Arrays.fill(buf, i, i + 128, b);
>        }
> 
>      // 1/2 chance the bytes of a vector are negative
>      // hardest / impossible to predict.
>      } else if ("HARDEST".equals(branchPrediction)) {
>        var r = new Random(SEED);
>        for (int i = 0; i < buf.length; i += 128) {
>          byte b = (byte) r.nextInt(128);
>          if (r.nextBoolean()) b = (byte) -b;
>          Arrays.fill(buf, i, i + 128, b);
>        }
>      }
>    }
> 
>    @Benchmark
>    public boolean scalar() {
>      boolean isAllPositive = false;
> 
>      for (int i = 0; i < 5_000; i += 128) {
>        boolean vectorHasNegatives = false;
>        for (int j = 0; j < 128; j++) {
>          if (buf[i + j] < 0) vectorHasNegatives = true;
>        }
>        isAllPositive = !vectorHasNegatives;
>      }
> 
>      return isAllPositive;
>    }
> 
>    @Benchmark
>    public boolean vectorBranchFree() {
>      var res = ByteVector.zero(SPECIES);
>      boolean isAllPositive = false;
> 
>      int i = 0;
>      for (; i < SPECIES.loopBound(buf.length); i += SPECIES.length()) {
>        var input = ByteVector.fromArray(SPECIES, buf, i);
>        isAllPositive = !input.test(IS_NEGATIVE).anyTrue();
>        res = res.lanewise(OR, input);
>      }
> 
>      return isAllPositive && res.test(IS_DEFAULT).allTrue();
>    }
> 
>    @Benchmark
>    public boolean vectorBranchy() {
>      var res = ByteVector.zero(SPECIES);
>      boolean isAllPositive = false;
> 
>      int i = 0;
>      for (; i < SPECIES.loopBound(buf.length); i += SPECIES.length()) {
>        var input = ByteVector.fromArray(SPECIES, buf, i);
>        isAllPositive = !input.test(IS_NEGATIVE).anyTrue();
>        if (isAllPositive) {
>          res = res.lanewise(OR, input);
>        }
>      }
> 
>      return isAllPositive && res.test(IS_DEFAULT).allTrue();
>    }
> }
> 
> I have tried a few different times and with different SEEDs too.
> Curious to know if you can reproduce.
> 
> Regards,
> 
> August
> 
> On Tue, Dec 22, 2020 at 3:19 PM August Nagro <augustnagro at gmail.com> wrote:
>>
>> Thank you Vladimir,
>>
>> I will try again with latest jdk 16.
>>
>>
>> On Tue, Dec 22, 2020, 3:03 PM Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>>
>>> Hi August,
>>>
>>> Thanks a lot for the benchmarks.
>>>
>>> I gave it a try with latest jdk16 and observed the following numbers:
>>>
>>> mvp1                12145580.543 ±  462197.858   ops/s
>>> mvp1:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
>>>
>>> mvp2                11813978.510 ± 1063694.922   ops/s
>>> mvp2:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec
>>>
>>> mvp3                  171990.456 ±   41828.282   ops/s
>>> mvp3:·gc.alloc.rate      566.613 ±     138.728  MB/sec
>>>
>>> The numbers for mvp1 and mvp2 are comparable and both benchmarks don't
>>> suffer from boxing. Additionally, I took a look at the inlining log and
>>> all the operations are inlined/intrinsified nicely.
>>>
>>> Regarding mvp1 vs mvp2 difference you see, I believe it is already fixed
>>> in the mainline by JDK-8257165 [1] and JDK-8257057 [2], but hasn't been
>>> merged into panama/vectorIntrinsics branch yet.
>>>
>>> Regarding mvp3, unfortunately, Vector::slice(int origin, Vector<E> v1)
>>> overload is not intrinsified yet and the call eventually ends in
>>> ByteVector::sliceTemplate() [3] which performs naive copy between the
>>> arrays backing vectors. So, it additionally suffers from box allocation
>>> overhead. Hopefully, it'll be fixed soon.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [1] https://bugs.openjdk.java.net/browse/JDK-8257165
>>> [2] https://bugs.openjdk.java.net/browse/JDK-8257057
>>>
>>> [3]
>>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java:
>>>
>>>
>>>       /*package-private*/
>>>       final
>>>       @ForceInline
>>>       ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
>>>           ByteVector that = (ByteVector) v1;
>>>           that.check(this);
>>>           byte[] a0 = this.vec();
>>>           byte[] a1 = that.vec();
>>>           byte[] res = new byte[a0.length];
>>>           int vlen = res.length;
>>>           int firstPart = vlen - origin;
>>>           System.arraycopy(a0, origin, res, 0, firstPart);
>>>           System.arraycopy(a1, 0, res, firstPart, origin);
>>>           return vectorFactory(res);
>>>       }
>>>
>>> On 22.12.2020 13:55, August Nagro wrote:
>>>> Ok, I started benchmarking in JMH piece by piece, and can share two
>>>> findings so far:
>>>>
>>>> 1. A single branch in the loop decimates performance (way more than c++):
>>>>
>>>> 5745.155 Ops/sec:
>>>> @Benchmark
>>>> public boolean mvp() {
>>>>     VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>>>>
>>>>     var res = ByteVector.zero(species);
>>>>     boolean hasNegs = false;
>>>>
>>>>     int i = 0;
>>>>     for (; i < species.loopBound(buf.length); i += species.length()) {
>>>>       var input = ByteVector.fromArray(species, buf, i);
>>>>       hasNegs = !input.test(IS_NEGATIVE).anyTrue();
>>>>       if (hasNegs) {
>>>>         res = res.lanewise(OR, input);
>>>>       }
>>>>     }
>>>>
>>>>     return hasNegs && res.test(IS_DEFAULT).allTrue();
>>>> }
>>>>
>>>> Without a branch, 75733.575 Ops/sec:
>>>>
>>>> @Benchmark
>>>> public boolean mvp() {
>>>>     VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>>>>
>>>>     var res = ByteVector.zero(species);
>>>>     boolean hasNegs = false;
>>>>
>>>>     int i = 0;
>>>>     for (; i < species.loopBound(buf.length); i += species.length()) {
>>>>       var input = ByteVector.fromArray(species, buf, i);
>>>>       hasNegs = !input.test(IS_NEGATIVE).anyTrue();
>>>>       res = res.lanewise(OR, input);
>>>>     }
>>>>
>>>>     return hasNegs && res.test(IS_DEFAULT).allTrue();
>>>> }
>>>>
>>>>
>>>> 2. slice() is very slow, 2669.239 Ops/sec:
>>>> @Benchmark
>>>> public boolean mvp() {
>>>>       VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>>>>       ByteVector x = ByteVector.zero(species);
>>>>
>>>>       int i = 0;
>>>>       for (; i < species.loopBound(buf.length); i += species.length()) {
>>>>         var input = ByteVector.fromArray(species, buf, i);
>>>>         x = x.slice(species.length() - 2, input);
>>>>       }
>>>>
>>>>       return x;
>>>> }
>>>>
>>>>
>>>> There were intrinsic failures for #1, but I must have been reading the
>>>> output wrong, because the same `@ <number>` shows up later and is
>>>> intensified.
>>>>
>>>> - August
>>>>
>>>> On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've been playing around with the vector api, and am trying to debug
>>>>> some performance problems.
>>>>>
>>>>> I'm on Linux x86, and it seems like some of the ops on ByteVector128
>>>>> are not being intrinsified. When I print the assembly with
>>>>> -XX:+PrintAssembly, I noticed that the ymm registers are not being
>>>>> used much / were close together. After -XX:+PrintIntrinsics I can see
>>>>> that there are some failures among the successes:
>>>>>
>>>>> @ 245   jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
>>>>> failed to inline (intrinsic)
>>>>>
>>>>> @ 52   jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
>>>>> to inline (intrinsic)
>>>>>
>>>>> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
>>>>>                                 @ 16
>>>>> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   failed to
>>>>> inline (intrinsic)
>>>>>
>>>>> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>>>>>                                     @ 106   java.lang.Object::getClass
>>>>> (0 bytes)   (intrinsic)
>>>>>                                     @ 134
>>>>> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes)   failed
>>>>> to inline (intrinsic)
>>>>>
>>>>>
>>>>> Here is one hot code that may be a problem:
>>>>> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
>>>>> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
>>>>> `bits.compare(LT, 0)`. Eventually it reaches the intensified
>>>>> VectorSupport::compare.
>>>>>
>>>>> I have never looked at hotspot intrinsic code before so bear with me.
>>>>> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
>>>>> don't understand where to go from here. However, I did notice file
>>>>> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
>>>>> comparison constants in VectorSupport.java.
>>>>>
>>>>> Am I on the right path here? And finally, is there a way to tell if
>>>>> Vector boxing is occurring?
>>>>>
>>>>> Regards,
>>>>>
>>>>> August