Slow code due to many AbstractVector::check and AbstractShuffle::checkIndexes checks in C2
Kai Burjack
kburjack at googlemail.com
Tue May 26 20:50:42 UTC 2020
Thanks, Paul. That is good to hear!
I am looking forward to and will definitely test any kind of optimizations
being put into vectorIntrinsics.
Though I totally understand if you guys will probably lay focus more on
optimizing shape-agnostic vector algorithms that are easy to JIT-optimize
with future hardware without modifying the user-code, like wide copies or
lane-wise operations, for which shuffle/rearrange is probably not that
important.
Kai.
Am Di., 26. Mai 2020 um 22:30 Uhr schrieb Paul Sandoz <
paul.sandoz at oracle.com>:
> Thanks, very helpful.
>
> I modified the vector code to do this:
>
> @ForceInline
> public final VectorShuffle<E> checkIndexes() {
> if (VectorIntrinsics.VECTOR_ACCESS_OOB_CHECK == 0) {
> return this;
> }
> // FIXME: vectorize this
> for (int index : reorder()) {
> if (index < 0) {
> throw checkIndexFailed(index, length());
> }
> }
> return this;
> }
>
> It’s definitely the bounds check for the shuffle that’s causing the issue.
>
> If I run the benchmark for mul128LoopArr with bounds checks disabled then
> the hot inner loop (from the dtraceasm profiler) is:
>
> 4.12% ↗ 0x000000010e680cf0: shl $0x2,%r11d
> 0.92% │ 0x000000010e680cf4: mov %r11d,%r9d
> 2.25% │ 0x000000010e680cf7: add $0x4,%r9d
> 1.78% │ 0x000000010e680cfb: mov %r8d,%r11d
> 5.18% │↗ 0x000000010e680cfe: vmovdqu 0x10(%r10,%r9,4),%xmm3
> 3.31% ││ 0x000000010e680d05: vpermd %ymm3,%ymm11,%ymm4
> 2.27% ││ 0x000000010e680d0a: vpermd %ymm3,%ymm2,%ymm5
> 1.94% ││ 0x000000010e680d0f: vmulps %xmm7,%xmm4,%xmm4
> 4.94% ││ 0x000000010e680d13: vpermd %ymm3,%ymm1,%ymm6
> 2.98% ││ 0x000000010e680d18: vpermd %ymm3,%ymm0,%ymm3
> 2.67% ││ 0x000000010e680d1d: vfmadd231ps %xmm10,%xmm3,%xmm4
> 2.14% ││ 0x000000010e680d22: vfmadd231ps %xmm9,%xmm6,%xmm4
> 5.10% ││ 0x000000010e680d27: vfmadd231ps %xmm8,%xmm5,%xmm4
> 4.41% ││ 0x000000010e680d2c: vmovdqu %xmm4,0x10(%r10,%r9,4)
> 4.63% ││ 0x000000010e680d33: mov %r11d,%r8d
> 1.59% ││ 0x000000010e680d36: inc %r8d
> 4.06% ││ 0x000000010e680d39: nopl 0x0(%rax)
> 1.84% ││ 0x000000010e680d40: cmp $0x4,%r8d
> ╰│ 0x000000010e680d44: jl 0x000000010e680cf0
>
> And it beats the scalar and JNI versions.
> (Note: the vector load addressing is not optimal)
>
> We “just" need to optimize the shuffle. I would expect if we are using
> constant shuffles that the bounds checks can be completely elided.
>
> Paul.
>
>
> > On May 23, 2020, at 2:38 PM, John Rose <john.r.rose at oracle.com> wrote:
> >
> > On May 23, 2020, at 3:33 AM, Kai Burjack <kburjack at googlemail.com
> <mailto:kburjack at googlemail.com>> wrote:
> >>
> >> Hi Paul, hi John,
> >>
> >> thanks for getting back to me about it!
> >>
> >> I've prepared a standard Maven JMH benchmark under:
> >> https://github.com/JOML-CI/panama-vector-bench <
> https://github.com/JOML-CI/panama-vector-bench>
> >> The README.md contains my current results with as
> >> much optimization as I could cram out of the code for my
> >> test CPU.
> >>
> >> I always test from the current tip of the vectorIntrinsics branch of:
> >> https://github.com/openjdk/panama-vector/tree/vectorIntrinsics <
> https://github.com/openjdk/panama-vector/tree/vectorIntrinsics>
> >> as it can be nicely shallow-cloned in a few seconds.
> >>
> >> The results I gave before were based on the commit:
> >> "[vector] Undo workaround fix" commit.
> >>
> >> It'd be nice if at some point in the future any vectorized algorithm
> >> was faster than the scalar one in those benchmarks.
> >>
> >> Thanks for looking into it!
> >
> > Thanks for the extra data. (Replying to panama-dev to get
> > it logged.)
> >
> >> Would it be possible to simply expose the vector species,
> >> like Float128Vector statically to user code so as not having to
> >> call vspecies() and drag the actual species as runtime information
> >> through the template code in FloatVector? That way, the JIT
> >> would statically know that the user is really only working with
> >> a particular vector species and can optimize for it?
> >
> > The JIT is smart and can do that already. If it fails to do
> > so in a particular case, there may be a bug in the JIT,
> > but we expect that any code path which uses just one
> > kind of vector will “sniff out” the exact type of that vector
> > and DTRT without the user knowing the name of that
> > exact type.
> >
> > This expectation extends even to vector-species-polymorphic
> > algorithms, as long as either (a) they are inlined or (b) they
> > are used, dynamically, on only one species at a time. We
> > are thinking about additional techniques which would lift
> > even those restrictions, in the setting of further optimizations
> > for streams, and eventually streams-over-vectors.
> >
> >> I am very sure there is a reason for the current design.
> >
> > Yep. One reason is complexity: We are willing to burn
> > in 1+N type names (to cover N lane types) but not 1+N*(1+M)
> > type names (to cover M shapes). Another reason is to
> > encourage programmers to avoid static dependencies on
> > particular species; this will (we think) lead to more portable
> > code. Yet another reason, building on that, is that we don’t
> > at this time know all of the shapes we will be supporting
> > over time. The existing VectorShape enum reflects current
> > hardware and software assumptions, and is very very likely
> > to expand over time.
> >
> > — John
>
>
More information about the panama-dev
mailing list