Vector Intrinsics & Boxing Problems?
August Nagro
augustnagro at gmail.com
Tue Dec 22 10:55:27 UTC 2020
Ok, I started benchmarking in JMH piece by piece, and can share two
findings so far:
1. A single branch in the loop decimates performance (way more than c++):
5745.155 Ops/sec:
@Benchmark
public boolean mvp() {
VectorSpecies<Byte> species = ByteVector.SPECIES_128;
var res = ByteVector.zero(species);
boolean hasNegs = false;
int i = 0;
for (; i < species.loopBound(buf.length); i += species.length()) {
var input = ByteVector.fromArray(species, buf, i);
hasNegs = !input.test(IS_NEGATIVE).anyTrue();
if (hasNegs) {
res = res.lanewise(OR, input);
}
}
return hasNegs && res.test(IS_DEFAULT).allTrue();
}
Without a branch, 75733.575 Ops/sec:
@Benchmark
public boolean mvp() {
VectorSpecies<Byte> species = ByteVector.SPECIES_128;
var res = ByteVector.zero(species);
boolean hasNegs = false;
int i = 0;
for (; i < species.loopBound(buf.length); i += species.length()) {
var input = ByteVector.fromArray(species, buf, i);
hasNegs = !input.test(IS_NEGATIVE).anyTrue();
res = res.lanewise(OR, input);
}
return hasNegs && res.test(IS_DEFAULT).allTrue();
}
2. slice() is very slow, 2669.239 Ops/sec:
@Benchmark
public boolean mvp() {
VectorSpecies<Byte> species = ByteVector.SPECIES_128;
ByteVector x = ByteVector.zero(species);
int i = 0;
for (; i < species.loopBound(buf.length); i += species.length()) {
var input = ByteVector.fromArray(species, buf, i);
x = x.slice(species.length() - 2, input);
}
return x;
}
There were intrinsic failures for #1, but I must have been reading the
output wrong, because the same `@ <number>` shows up later and is
intensified.
- August
On Tue, Dec 22, 2020 at 1:04 AM August Nagro <augustnagro at gmail.com> wrote:
>
> Hello,
>
> I've been playing around with the vector api, and am trying to debug
> some performance problems.
>
> I'm on Linux x86, and it seems like some of the ops on ByteVector128
> are not being intrinsified. When I print the assembly with
> -XX:+PrintAssembly, I noticed that the ymm registers are not being
> used much / were close together. After -XX:+PrintIntrinsics I can see
> that there are some failures among the successes:
>
> @ 245 jdk.internal.vm.vector.VectorSupport::binaryOp (36 bytes)
> failed to inline (intrinsic)
>
> @ 52 jdk.internal.vm.vector.VectorSupport::compare (40 bytes) failed
> to inline (intrinsic)
>
> ** missing constant: vclass=ConP etype=ConP vlen=ConI idx=Parm
> @ 16
> jdk.internal.vm.vector.VectorSupport::extract (35 bytes) failed to
> inline (intrinsic)
>
> ** missing constant: opr=RShiftI vclass=ConP etype=ConP vlen=ConI
> @ 106 java.lang.Object::getClass
> (0 bytes) (intrinsic)
> @ 134
> jdk.internal.vm.vector.VectorSupport::broadcastInt (36 bytes) failed
> to inline (intrinsic)
>
>
> Here is one hot code that may be a problem:
> byteVector128.test(IS_NEGATIVE). Inspecting the source, I see an
> IS_NEGATIVE test is passed to ByteVector::testTemplate, which calls
> `bits.compare(LT, 0)`. Eventually it reaches the intensified
> VectorSupport::compare.
>
> I have never looked at hotspot intrinsic code before so bear with me.
> In vmIntrinsics.hpp, _VectorCompare is the name of the template. I
> don't understand where to go from here. However, I did notice file
> share/prims/vectorSupport.hpp, which is missing the BT_lt and other
> comparison constants in VectorSupport.java.
>
> Am I on the right path here? And finally, is there a way to tell if
> Vector boxing is occurring?
>
> Regards,
>
> August
More information about the panama-dev
mailing list