Slow code due to many AbstractVector::check and AbstractShuffle::checkIndexes checks in C2

Fri May 22 22:25:56 UTC 2020

Hi Kai,

What is the hg id of the tip of your cloned repo?
Can you share your benchmark code?

The checks in FloatVector.fma operation should optimize fairly nicely, at least within loops. The checkIndexes of the shuffle.checkIndexes is likely to have a cost, it's not optimized right now. 

Paul.

> On May 22, 2020, at 3:02 AM, Kai Burjack <kburjack at googlemail.com> wrote:
> 
> I am very much looking forward to the Panama Vector API, which is currently
> developed in the
> vectorIntrinsics branch and I am currently playing with it for a Java
> matrix/vector library in order
> to speed up simple 4x4 matrix and vector multiplications, as has been done
> for many years in
> the .net core library using their SIMD intrinsics.
> 
> After having implemented an XMM and also YMM register-based algorithm for
> 4x4 matrix
> multiplications, the issue currently is that all potential speedups are
> eliminated even on C2
> due to various index checks. When looking at the disassembly in particular:
> 
> for FloatVector.rearrange:
> ; - jdk.internal.vm.vector.VectorSupport$VectorPayload::getPayload at -1 (line
> 98)
> ; - jdk.incubator.vector.AbstractShuffle::reorder at 1 (line 75)
> ; - jdk.incubator.vector.AbstractShuffle::checkIndexes at 1 (line 124)
> ; - jdk.incubator.vector.FloatVector::rearrangeTemplate at 1 (line 1995)
> 
> and FloatVector.fma:
> ; - jdk.incubator.vector.AbstractVector::sameSpecies at 8 (line 133)
> ; - jdk.incubator.vector.AbstractVector::check at 2 (line 124)
> ; - jdk.incubator.vector.FloatVector::lanewiseTemplate at 15 (line 814)
> ; - jdk.incubator.vector.Float256Vector::lanewise at 4 (line 289)
> ; - jdk.incubator.vector.Float256Vector::lanewise at 4 (line 41)
> ; - jdk.incubator.vector.FloatVector::fma at 6 (line 2133)
> 
> generate costly checks in C2. So the generated C2 code contains many
> thousands of
> instructions and branching for what could be a simple sequence of mostly
> vmulps, vaddps,
> vfmadd231ps, vpermd (or vshufps) and vmovdqu instructions.
> 
> If I patch both methods above to avoid the index checks (in particular the
> very costly
> check in FloatVector.rearrange()) I get my code down from ~53ns/op to
> ~11ns/op (JMH-benchmarked).
> I know it's probably very early to ask about performance for what's
> probably not even
> a primary use-case of Java (using it to accelerate numeric algorithms for
> computer graphics
> applications), but I just want to let you know that there are people caring
> about it. :)
> 
> Anyways, thanks for the fantastic work on Panama so far!
> 
> Kind regards,
> Kai