Vector Intrinsics & Boxing Problems?

Tue Dec 22 23:01:40 UTC 2020

Hi August,

Thanks a lot for the benchmarks.

I gave it a try with latest jdk16 and observed the following numbers:

mvp1                12145580.543 ±  462197.858   ops/s
mvp1:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec

mvp2                11813978.510 ± 1063694.922   ops/s
mvp2:·gc.alloc.rate       ≈ 10⁻⁴                MB/sec

mvp3                  171990.456 ±   41828.282   ops/s
mvp3:·gc.alloc.rate      566.613 ±     138.728  MB/sec

The numbers for mvp1 and mvp2 are comparable and both benchmarks don't 
suffer from boxing. Additionally, I took a look at the inlining log and 
all the operations are inlined/intrinsified nicely.

Regarding mvp1 vs mvp2 difference you see, I believe it is already fixed 
in the mainline by JDK-8257165 [1] and JDK-8257057 [2], but hasn't been 
merged into panama/vectorIntrinsics branch yet.

Regarding mvp3, unfortunately, Vector::slice(int origin, Vector<E> v1) 
overload is not intrinsified yet and the call eventually ends in 
ByteVector::sliceTemplate() [3] which performs naive copy between the 
arrays backing vectors. So, it additionally suffers from box allocation 
overhead. Hopefully, it'll be fixed soon.

Best regards,
Vladimir Ivanov

[1] https://bugs.openjdk.java.net/browse/JDK-8257165
[2] https://bugs.openjdk.java.net/browse/JDK-8257057

[3] 
src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java: 

     /*package-private*/
     final
     @ForceInline
     ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
         ByteVector that = (ByteVector) v1;
         that.check(this);
         byte[] a0 = this.vec();
         byte[] a1 = that.vec();
         byte[] res = new byte[a0.length];
         int vlen = res.length;
         int firstPart = vlen - origin;
         System.arraycopy(a0, origin, res, 0, firstPart);
         System.arraycopy(a1, 0, res, firstPart, origin);
         return vectorFactory(res);
     }

On 22.12.2020 13:55, August Nagro wrote:
> Ok, I started benchmarking in JMH piece by piece, and can share two
> findings so far:
> 
> 1. A single branch in the loop decimates performance (way more than c++):
> 
> 5745.155 Ops/sec:
> @Benchmark
> public boolean mvp() {
>    VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> 
>    var res = ByteVector.zero(species);
>    boolean hasNegs = false;
> 
>    int i = 0;
>    for (; i < species.loopBound(buf.length); i += species.length()) {
>      var input = ByteVector.fromArray(species, buf, i);
>      hasNegs = !input.test(IS_NEGATIVE).anyTrue();
>      if (hasNegs) {
>        res = res.lanewise(OR, input);
>      }
>    }
> 
>    return hasNegs && res.test(IS_DEFAULT).allTrue();
> }
> 
> Without a branch, 75733.575 Ops/sec:
> 
> @Benchmark
> public boolean mvp() {
>    VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> 
>    var res = ByteVector.zero(species);
>    boolean hasNegs = false;
> 
>    int i = 0;
>    for (; i < species.loopBound(buf.length); i += species.length()) {
>      var input = ByteVector.fromArray(species, buf, i);
>      hasNegs = !input.test(IS_NEGATIVE).anyTrue();
>      res = res.lanewise(OR, input);
>    }
> 
>    return hasNegs && res.test(IS_DEFAULT).allTrue();
> }
> 
> 
> 2. slice() is very slow, 2669.239 Ops/sec:
> @Benchmark
> public boolean mvp() {
>      VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>      ByteVector x = ByteVector.zero(species);
> 
>      int i = 0;
>      for (; i < species.loopBound(buf.length); i += species.length()) {
>        var input = ByteVector.fromArray(species, buf, i);
>        x = x.slice(species.length() - 2, input);
>      }
> 
>      return x;
> }
> 
> 
> There were intrinsic failures for #1, but I must have been reading the
> output wrong, because the same `@ <number>` shows up later and is
> intensified.
> 
> - August
> 
> On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com> wrote:
>>
>> Hello,
>>
>> I've been playing around with the vector api, and am trying to debug
>> some performance problems.
>>
>> I'm on Linux x86, and it seems like some of the ops on ByteVector128
>> are not being intrinsified. When I print the assembly with
>> -XX:+PrintAssembly, I noticed that the ymm registers are not being
>> used much / were close together. After -XX:+PrintIntrinsics I can see
>> that there are some failures among the successes:
>>
>> @ 245   jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
>> failed to inline (intrinsic)
>>
>> @ 52   jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
>> to inline (intrinsic)
>>
>> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
>>                                @ 16
>> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   failed to
>> inline (intrinsic)
>>
>> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>>                                    @ 106   java.lang.Object::getClass
>> (0 bytes)   (intrinsic)
>>                                    @ 134
>> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes)   failed
>> to inline (intrinsic)
>>
>>
>> Here is one hot code that may be a problem:
>> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
>> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
>> `bits.compare(LT, 0)`. Eventually it reaches the intensified
>> VectorSupport::compare.
>>
>> I have never looked at hotspot intrinsic code before so bear with me.
>> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
>> don't understand where to go from here. However, I did notice file
>> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
>> comparison constants in VectorSupport.java.
>>
>> Am I on the right path here? And finally, is there a way to tell if
>> Vector boxing is occurring?
>>
>> Regards,
>>
>> August