Slow code due to many AbstractVector::check and AbstractShuffle::checkIndexes checks in C2
Paul Sandoz
paul.sandoz at oracle.com
Tue May 26 20:30:46 UTC 2020
Thanks, very helpful.
I modified the vector code to do this:
@ForceInline
public final VectorShuffle<E> checkIndexes() {
if (VectorIntrinsics.VECTOR_ACCESS_OOB_CHECK == 0) {
return this;
}
// FIXME: vectorize this
for (int index : reorder()) {
if (index < 0) {
throw checkIndexFailed(index, length());
}
}
return this;
}
It’s definitely the bounds check for the shuffle that’s causing the issue.
If I run the benchmark for mul128LoopArr with bounds checks disabled then the hot inner loop (from the dtraceasm profiler) is:
4.12% ↗ 0x000000010e680cf0: shl $0x2,%r11d
0.92% │ 0x000000010e680cf4: mov %r11d,%r9d
2.25% │ 0x000000010e680cf7: add $0x4,%r9d
1.78% │ 0x000000010e680cfb: mov %r8d,%r11d
5.18% │↗ 0x000000010e680cfe: vmovdqu 0x10(%r10,%r9,4),%xmm3
3.31% ││ 0x000000010e680d05: vpermd %ymm3,%ymm11,%ymm4
2.27% ││ 0x000000010e680d0a: vpermd %ymm3,%ymm2,%ymm5
1.94% ││ 0x000000010e680d0f: vmulps %xmm7,%xmm4,%xmm4
4.94% ││ 0x000000010e680d13: vpermd %ymm3,%ymm1,%ymm6
2.98% ││ 0x000000010e680d18: vpermd %ymm3,%ymm0,%ymm3
2.67% ││ 0x000000010e680d1d: vfmadd231ps %xmm10,%xmm3,%xmm4
2.14% ││ 0x000000010e680d22: vfmadd231ps %xmm9,%xmm6,%xmm4
5.10% ││ 0x000000010e680d27: vfmadd231ps %xmm8,%xmm5,%xmm4
4.41% ││ 0x000000010e680d2c: vmovdqu %xmm4,0x10(%r10,%r9,4)
4.63% ││ 0x000000010e680d33: mov %r11d,%r8d
1.59% ││ 0x000000010e680d36: inc %r8d
4.06% ││ 0x000000010e680d39: nopl 0x0(%rax)
1.84% ││ 0x000000010e680d40: cmp $0x4,%r8d
╰│ 0x000000010e680d44: jl 0x000000010e680cf0
And it beats the scalar and JNI versions.
(Note: the vector load addressing is not optimal)
We “just" need to optimize the shuffle. I would expect if we are using constant shuffles that the bounds checks can be completely elided.
Paul.
> On May 23, 2020, at 2:38 PM, John Rose <john.r.rose at oracle.com> wrote:
>
> On May 23, 2020, at 3:33 AM, Kai Burjack <kburjack at googlemail.com <mailto:kburjack at googlemail.com>> wrote:
>>
>> Hi Paul, hi John,
>>
>> thanks for getting back to me about it!
>>
>> I've prepared a standard Maven JMH benchmark under:
>> https://github.com/JOML-CI/panama-vector-bench <https://github.com/JOML-CI/panama-vector-bench>
>> The README.md contains my current results with as
>> much optimization as I could cram out of the code for my
>> test CPU.
>>
>> I always test from the current tip of the vectorIntrinsics branch of:
>> https://github.com/openjdk/panama-vector/tree/vectorIntrinsics <https://github.com/openjdk/panama-vector/tree/vectorIntrinsics>
>> as it can be nicely shallow-cloned in a few seconds.
>>
>> The results I gave before were based on the commit:
>> "[vector] Undo workaround fix" commit.
>>
>> It'd be nice if at some point in the future any vectorized algorithm
>> was faster than the scalar one in those benchmarks.
>>
>> Thanks for looking into it!
>
> Thanks for the extra data. (Replying to panama-dev to get
> it logged.)
>
>> Would it be possible to simply expose the vector species,
>> like Float128Vector statically to user code so as not having to
>> call vspecies() and drag the actual species as runtime information
>> through the template code in FloatVector? That way, the JIT
>> would statically know that the user is really only working with
>> a particular vector species and can optimize for it?
>
> The JIT is smart and can do that already. If it fails to do
> so in a particular case, there may be a bug in the JIT,
> but we expect that any code path which uses just one
> kind of vector will “sniff out” the exact type of that vector
> and DTRT without the user knowing the name of that
> exact type.
>
> This expectation extends even to vector-species-polymorphic
> algorithms, as long as either (a) they are inlined or (b) they
> are used, dynamically, on only one species at a time. We
> are thinking about additional techniques which would lift
> even those restrictions, in the setting of further optimizations
> for streams, and eventually streams-over-vectors.
>
>> I am very sure there is a reason for the current design.
>
> Yep. One reason is complexity: We are willing to burn
> in 1+N type names (to cover N lane types) but not 1+N*(1+M)
> type names (to cover M shapes). Another reason is to
> encourage programmers to avoid static dependencies on
> particular species; this will (we think) lead to more portable
> code. Yet another reason, building on that, is that we don’t
> at this time know all of the shapes we will be supporting
> over time. The existing VectorShape enum reflects current
> hardware and software assumptions, and is very very likely
> to expand over time.
>
> — John
More information about the panama-dev
mailing list