Slow code due to many AbstractVector::check and AbstractShuffle::checkIndexes checks in C2

Thu Sep 3 14:05:54 UTC 2020

Hi Paul,

are there any plans on including this performance improvement?

So far, for my use-case of 4x4 matrix-matrix and matrix-vector
multiplications and matrix inversions, which rely heavily on shuffling, the
current state of the Vector API (80dc9b64b58e - commit from Mon, 31 Aug
2020 15:48:11 -0700) would not be a performance benefit over scalar code.
Tested with: https://github.com/JOML-CI/panama-vector-bench

Thank you!

Am Di., 26. Mai 2020 um 22:30 Uhr schrieb Paul Sandoz <
paul.sandoz at oracle.com>:

> Thanks, very helpful.
>
> I modified the vector code to do this:
>
> @ForceInline
> public final VectorShuffle<E> checkIndexes() {
>     if (VectorIntrinsics.VECTOR_ACCESS_OOB_CHECK == 0) {
>         return this;
>     }
>     // FIXME: vectorize this
>     for (int index : reorder()) {
>         if (index < 0) {
>             throw checkIndexFailed(index, length());
>         }
>     }
>     return this;
> }
>
> It’s definitely the bounds check for the shuffle that’s causing the issue.
>
> If I run the benchmark for mul128LoopArr with bounds checks disabled then
> the hot inner loop (from the dtraceasm profiler) is:
>
>   4.12%  ↗   0x000000010e680cf0:   shl    $0x2,%r11d
>   0.92%  │   0x000000010e680cf4:   mov    %r11d,%r9d
>   2.25%  │   0x000000010e680cf7:   add    $0x4,%r9d
>   1.78%  │   0x000000010e680cfb:   mov    %r8d,%r11d
>   5.18%  │↗  0x000000010e680cfe:   vmovdqu 0x10(%r10,%r9,4),%xmm3
>   3.31%  ││  0x000000010e680d05:   vpermd %ymm3,%ymm11,%ymm4
>   2.27%  ││  0x000000010e680d0a:   vpermd %ymm3,%ymm2,%ymm5
>   1.94%  ││  0x000000010e680d0f:   vmulps %xmm7,%xmm4,%xmm4
>   4.94%  ││  0x000000010e680d13:   vpermd %ymm3,%ymm1,%ymm6
>   2.98%  ││  0x000000010e680d18:   vpermd %ymm3,%ymm0,%ymm3
>   2.67%  ││  0x000000010e680d1d:   vfmadd231ps %xmm10,%xmm3,%xmm4
>   2.14%  ││  0x000000010e680d22:   vfmadd231ps %xmm9,%xmm6,%xmm4
>   5.10%  ││  0x000000010e680d27:   vfmadd231ps %xmm8,%xmm5,%xmm4
>   4.41%  ││  0x000000010e680d2c:   vmovdqu %xmm4,0x10(%r10,%r9,4)
>   4.63%  ││  0x000000010e680d33:   mov    %r11d,%r8d
>   1.59%  ││  0x000000010e680d36:   inc    %r8d
>   4.06%  ││  0x000000010e680d39:   nopl   0x0(%rax)
>   1.84%  ││  0x000000010e680d40:   cmp    $0x4,%r8d
>          ╰│  0x000000010e680d44:   jl     0x000000010e680cf0
>
> And it beats the scalar and JNI versions.
> (Note: the vector load addressing is not optimal)
>
> We “just" need to optimize the shuffle. I would expect if we are using
> constant shuffles that the bounds checks can be completely elided.
>
> Paul.
>
>
> > On May 23, 2020, at 2:38 PM, John Rose <john.r.rose at oracle.com> wrote:
> >
> > On May 23, 2020, at 3:33 AM, Kai Burjack <kburjack at googlemail.com
> <mailto:kburjack at googlemail.com>> wrote:
> >>
> >> Hi Paul, hi John,
> >>
> >> thanks for getting back to me about it!
> >>
> >> I've prepared a standard Maven JMH benchmark under:
> >> https://github.com/JOML-CI/panama-vector-bench <
> https://github.com/JOML-CI/panama-vector-bench>
> >> The README.md contains my current results with as
> >> much optimization as I could cram out of the code for my
> >> test CPU.
> >>
> >> I always test from the current tip of the vectorIntrinsics branch of:
> >> https://github.com/openjdk/panama-vector/tree/vectorIntrinsics <
> https://github.com/openjdk/panama-vector/tree/vectorIntrinsics>
> >> as it can be nicely shallow-cloned in a few seconds.
> >>
> >> The results I gave before were based on the commit:
> >> "[vector] Undo workaround fix" commit.
> >>
> >> It'd be nice if at some point in the future any vectorized algorithm
> >> was faster than the scalar one in those benchmarks.
> >>
> >> Thanks for looking into it!
> >
> > Thanks for the extra data.  (Replying to panama-dev to get
> > it logged.)
> >
> >> Would it be possible to simply expose the vector species,
> >> like Float128Vector statically to user code so as not having to
> >> call vspecies() and drag the actual species as runtime information
> >> through the template code in FloatVector? That way, the JIT
> >> would statically know that the user is really only working with
> >> a particular vector species and can optimize for it?
> >
> > The JIT is smart and can do that already.  If it fails to do
> > so in a particular case, there may be a bug in the JIT,
> > but we expect that any code path which uses just one
> > kind of vector will “sniff out” the exact type of that vector
> > and DTRT without the user knowing the name of that
> > exact type.
> >
> > This expectation extends even to vector-species-polymorphic
> > algorithms, as long as either (a) they are inlined or (b) they
> > are used, dynamically, on only one species at a time.  We
> > are thinking about additional techniques which would lift
> > even those restrictions, in the setting of further optimizations
> > for streams, and eventually streams-over-vectors.
> >
> >> I am very sure there is a reason for the current design.
> >
> > Yep.  One reason is complexity:  We are willing to burn
> > in 1+N type names (to cover N lane types) but not 1+N*(1+M)
> > type names (to cover M shapes).  Another reason is to
> > encourage programmers to avoid static dependencies on
> > particular species; this will (we think) lead to more portable
> > code.  Yet another reason, building on that, is that we don’t
> > at this time know all of the shapes we will be supporting
> > over time.  The existing VectorShape enum reflects current
> > hardware and software assumptions, and is very very likely
> > to expand over time.
> >
> > — John
>
>