IntVector.fromValues is not optimized away ?

Tue May 12 01:39:26 UTC 2020

On May 11, 2020, at 5:14 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> I wonder if it's possible to teach the shared reduction code about operations using the identity value?

In general, I’d encourage us to put as much into shared code as
possible.  We have more vector hardware in our future; I’m thinking
of GPUs of course, and who knows what other CPUs or VPUs will be
important in 10 years.

BRW, this reminds me that in some cases reduction operations are
most naturally formulated as type (scalar, vector) -> scalar, not just
(vector) -> scalar.  The two-argument form reduces to the one-argument
form when the input scalar is the identity value.  The two-argument
form is useful when several vectors are being rolled up together,
perhaps in a loop.  I think we may want (not now but later) to make
the building block be the two-argument reduction, not the simpler
one.

Also BTW, and independently, we might wish to make a shared
convention (in C2 and the Java code) that reductions are always
done in some particular order, when it matters.  If we do make
such a choice, we should choose a particular binary spanning tree,
since that, generally speaking, is how it’s done in hardware.
Disagreements between spanning tree orders can be removed
(if needed) by one-time permutations of the input.

It seems to me that the two observations work against each other,
since you can’t build such a good spanning tree on 1+2^lgN nodes
as you can on 2^lgN nodes.  This is one reason we need some time
(after the current release) to consider the proper order specification
for reductions in our portable API.

(BTW, the difference in order only matters with floating operations
that have NaNs and/or rounding errors.  So the problems with order
are limited only to those, and whatever other non-associative
operations we might define in the future.)

Two arguments in favor of reducing in N-1 sequential steps instead
of lgN steps of parallel operations:  It’s the simplest to specify, and
works best with the binary version of reduction.  One argument
against:  It will make rounding and NaNs slow FP operations down.
Maybe there’s a “strictfp” move we can use to allow the JVM more
latitude for reordering reductions in to lan trees, except in strict code.

— John