Vector Intrinsics & Boxing Problems?
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Tue Dec 22 23:01:40 UTC 2020
Hi August,
Thanks a lot for the benchmarks.
I gave it a try with latest jdk16 and observed the following numbers:
mvp1 12145580.543 ± 462197.858 ops/s
mvp1:·gc.alloc.rate ≈ 10⁻⁴ MB/sec
mvp2 11813978.510 ± 1063694.922 ops/s
mvp2:·gc.alloc.rate ≈ 10⁻⁴ MB/sec
mvp3 171990.456 ± 41828.282 ops/s
mvp3:·gc.alloc.rate 566.613 ± 138.728 MB/sec
The numbers for mvp1 and mvp2 are comparable and both benchmarks don't
suffer from boxing. Additionally, I took a look at the inlining log and
all the operations are inlined/intrinsified nicely.
Regarding mvp1 vs mvp2 difference you see, I believe it is already fixed
in the mainline by JDK-8257165 [1] and JDK-8257057 [2], but hasn't been
merged into panama/vectorIntrinsics branch yet.
Regarding mvp3, unfortunately, Vector::slice(int origin, Vector<E> v1)
overload is not intrinsified yet and the call eventually ends in
ByteVector::sliceTemplate() [3] which performs naive copy between the
arrays backing vectors. So, it additionally suffers from box allocation
overhead. Hopefully, it'll be fixed soon.
Best regards,
Vladimir Ivanov
[1] https://bugs.openjdk.java.net/browse/JDK-8257165
[2] https://bugs.openjdk.java.net/browse/JDK-8257057
[3]
src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java:
/*package-private*/
final
@ForceInline
ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
ByteVector that = (ByteVector) v1;
that.check(this);
byte[] a0 = this.vec();
byte[] a1 = that.vec();
byte[] res = new byte[a0.length];
int vlen = res.length;
int firstPart = vlen - origin;
System.arraycopy(a0, origin, res, 0, firstPart);
System.arraycopy(a1, 0, res, firstPart, origin);
return vectorFactory(res);
}
On 22.12.2020 13:55, August Nagro wrote:
> Ok, I started benchmarking in JMH piece by piece, and can share two
> findings so far:
>
> 1. A single branch in the loop decimates performance (way more than c++):
>
> 5745.155 Ops/sec:
> @Benchmark
> public boolean mvp() {
> VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>
> var res = ByteVector.zero(species);
> boolean hasNegs = false;
>
> int i = 0;
> for (; i < species.loopBound(buf.length); i += species.length()) {
> var input = ByteVector.fromArray(species, buf, i);
> hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> if (hasNegs) {
> res = res.lanewise(OR, input);
> }
> }
>
> return hasNegs && res.test(IS_DEFAULT).allTrue();
> }
>
> Without a branch, 75733.575 Ops/sec:
>
> @Benchmark
> public boolean mvp() {
> VectorSpecies<Byte> species = ByteVector.SPECIES_128;
>
> var res = ByteVector.zero(species);
> boolean hasNegs = false;
>
> int i = 0;
> for (; i < species.loopBound(buf.length); i += species.length()) {
> var input = ByteVector.fromArray(species, buf, i);
> hasNegs = !input.test(IS_NEGATIVE).anyTrue();
> res = res.lanewise(OR, input);
> }
>
> return hasNegs && res.test(IS_DEFAULT).allTrue();
> }
>
>
> 2. slice() is very slow, 2669.239 Ops/sec:
> @Benchmark
> public boolean mvp() {
> VectorSpecies<Byte> species = ByteVector.SPECIES_128;
> ByteVector x = ByteVector.zero(species);
>
> int i = 0;
> for (; i < species.loopBound(buf.length); i += species.length()) {
> var input = ByteVector.fromArray(species, buf, i);
> x = x.slice(species.length() - 2, input);
> }
>
> return x;
> }
>
>
> There were intrinsic failures for #1, but I must have been reading the
> output wrong, because the same `@ <number>` shows up later and is
> intensified.
>
> - August
>
> On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com> wrote:
>>
>> Hello,
>>
>> I've been playing around with the vector api, and am trying to debug
>> some performance problems.
>>
>> I'm on Linux x86, and it seems like some of the ops on ByteVector128
>> are not being intrinsified. When I print the assembly with
>> -XX:+PrintAssembly, I noticed that the ymm registers are not being
>> used much / were close together. After -XX:+PrintIntrinsics I can see
>> that there are some failures among the successes:
>>
>> @ 245 jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
>> failed to inline (intrinsic)
>>
>> @ 52 jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
>> to inline (intrinsic)
>>
>> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
>> @ 16
>> jdk.internal.vm.vector.VectorSupport::extract (35 bytes) failed to
>> inline (intrinsic)
>>
>> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
>> @ 106 java.lang.Object::getClass
>> (0 bytes) (intrinsic)
>> @ 134
>> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes) failed
>> to inline (intrinsic)
>>
>>
>> Here is one hot code that may be a problem:
>> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
>> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
>> `bits.compare(LT, 0)`. Eventually it reaches the intensified
>> VectorSupport::compare.
>>
>> I have never looked at hotspot intrinsic code before so bear with me.
>> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
>> don't understand where to go from here. However, I did notice file
>> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
>> comparison constants in VectorSupport.java.
>>
>> Am I on the right path here? And finally, is there a way to tell if
>> Vector boxing is occurring?
>>
>> Regards,
>>
>> August
More information about the panama-dev
mailing list