Slow code due to many AbstractVector::check and AbstractShuffle::checkIndexes checks in C2

Tue May 26 20:30:46 UTC 2020

Thanks, very helpful.

I modified the vector code to do this:

@ForceInline
public final VectorShuffle<E> checkIndexes() {
    if (VectorIntrinsics.VECTOR_ACCESS_OOB_CHECK == 0) {
        return this;
    }
    // FIXME: vectorize this
    for (int index : reorder()) {
        if (index < 0) {
            throw checkIndexFailed(index, length());
        }
    }
    return this;
}

It’s definitely the bounds check for the shuffle that’s causing the issue.

If I run the benchmark for mul128LoopArr with bounds checks disabled then the hot inner loop (from the dtraceasm profiler) is:

  4.12%  ↗   0x000000010e680cf0:   shl    $0x2,%r11d
  0.92%  │   0x000000010e680cf4:   mov    %r11d,%r9d
  2.25%  │   0x000000010e680cf7:   add    $0x4,%r9d
  1.78%  │   0x000000010e680cfb:   mov    %r8d,%r11d
  5.18%  │↗  0x000000010e680cfe:   vmovdqu 0x10(%r10,%r9,4),%xmm3   
  3.31%  ││  0x000000010e680d05:   vpermd %ymm3,%ymm11,%ymm4
  2.27%  ││  0x000000010e680d0a:   vpermd %ymm3,%ymm2,%ymm5
  1.94%  ││  0x000000010e680d0f:   vmulps %xmm7,%xmm4,%xmm4
  4.94%  ││  0x000000010e680d13:   vpermd %ymm3,%ymm1,%ymm6
  2.98%  ││  0x000000010e680d18:   vpermd %ymm3,%ymm0,%ymm3
  2.67%  ││  0x000000010e680d1d:   vfmadd231ps %xmm10,%xmm3,%xmm4
  2.14%  ││  0x000000010e680d22:   vfmadd231ps %xmm9,%xmm6,%xmm4
  5.10%  ││  0x000000010e680d27:   vfmadd231ps %xmm8,%xmm5,%xmm4
  4.41%  ││  0x000000010e680d2c:   vmovdqu %xmm4,0x10(%r10,%r9,4)
  4.63%  ││  0x000000010e680d33:   mov    %r11d,%r8d
  1.59%  ││  0x000000010e680d36:   inc    %r8d
  4.06%  ││  0x000000010e680d39:   nopl   0x0(%rax)
  1.84%  ││  0x000000010e680d40:   cmp    $0x4,%r8d
         ╰│  0x000000010e680d44:   jl     0x000000010e680cf0

And it beats the scalar and JNI versions.
(Note: the vector load addressing is not optimal)

We “just" need to optimize the shuffle. I would expect if we are using constant shuffles that the bounds checks can be completely elided.

Paul.

> On May 23, 2020, at 2:38 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> On May 23, 2020, at 3:33 AM, Kai Burjack <kburjack at googlemail.com <mailto:kburjack at googlemail.com>> wrote:
>> 
>> Hi Paul, hi John,
>> 
>> thanks for getting back to me about it!
>> 
>> I've prepared a standard Maven JMH benchmark under:
>> https://github.com/JOML-CI/panama-vector-bench <https://github.com/JOML-CI/panama-vector-bench>
>> The README.md contains my current results with as
>> much optimization as I could cram out of the code for my
>> test CPU.
>> 
>> I always test from the current tip of the vectorIntrinsics branch of:
>> https://github.com/openjdk/panama-vector/tree/vectorIntrinsics <https://github.com/openjdk/panama-vector/tree/vectorIntrinsics>
>> as it can be nicely shallow-cloned in a few seconds.
>> 
>> The results I gave before were based on the commit:
>> "[vector] Undo workaround fix" commit.
>> 
>> It'd be nice if at some point in the future any vectorized algorithm
>> was faster than the scalar one in those benchmarks.
>> 
>> Thanks for looking into it!
> 
> Thanks for the extra data.  (Replying to panama-dev to get
> it logged.)
> 
>> Would it be possible to simply expose the vector species,
>> like Float128Vector statically to user code so as not having to
>> call vspecies() and drag the actual species as runtime information
>> through the template code in FloatVector? That way, the JIT
>> would statically know that the user is really only working with
>> a particular vector species and can optimize for it?
> 
> The JIT is smart and can do that already.  If it fails to do
> so in a particular case, there may be a bug in the JIT,
> but we expect that any code path which uses just one
> kind of vector will “sniff out” the exact type of that vector
> and DTRT without the user knowing the name of that
> exact type.
> 
> This expectation extends even to vector-species-polymorphic
> algorithms, as long as either (a) they are inlined or (b) they
> are used, dynamically, on only one species at a time.  We
> are thinking about additional techniques which would lift
> even those restrictions, in the setting of further optimizations
> for streams, and eventually streams-over-vectors.
> 
>> I am very sure there is a reason for the current design.
> 
> Yep.  One reason is complexity:  We are willing to burn
> in 1+N type names (to cover N lane types) but not 1+N*(1+M)
> type names (to cover M shapes).  Another reason is to
> encourage programmers to avoid static dependencies on
> particular species; this will (we think) lead to more portable
> code.  Yet another reason, building on that, is that we don’t
> at this time know all of the shapes we will be supporting
> over time.  The existing VectorShape enum reflects current
> hardware and software assumptions, and is very very likely
> to expand over time.
> 
> — John