[vector] RFR 8221816: IndexOutOfBoundsException for fromArray/intoArray with unset mask lanes - was: RE: IndexOutOfBoundsException with unset mask lanes

Fri May 31 18:27:00 UTC 2019

On May 31, 2019, at 5:32 AM, Joshua Zhu (Arm Technology China) <Joshua.Zhu at arm.com> wrote:
> 
> In my initial thought, in long term masked access should behave in the way like:

Thanks for thinking about this problem, Joshua.
I think it's important to keep chewing on.
I am sure that (eventually) we'll find a solution.

> Main loop (fast path in my patch which succeed in range check) will work same with current implementation.

Yes.  If this requirement is to be fully met, and the
source code has "masks all the way" through the loop,
it follows that we need robust strength reduction where
the JIT can see constant-true masks and convert them
into regular unmasked operations.  The constant-true
masks are an annoyance to the main loop and need to
be suppressed.  They are an artifact of the original
source code having just one kernel, which gets used
in two ways, (in effect) unmasked for the main loop,
and masks for loop boundaries (could include both
pre- and post-loops).

One problem we puzzling with here is how to design
the source-level API so that it's easy to write the loop
without masks cluttering up the logic, just so that the
edges can be masked.  I think this will involve equipping
some vector shapes with associated masks which are
automatically applied.  The JIT's logic currently doesn't
allow vector-boxes to be combined in this way (via
fields) but I think it might be reasonable to investigate,
even before we get inline value types from Valhalla.

In any case, whether the masks are "submerged"
inside new shapes or hidden in some other clever
way, the JIT will want to see these mostly-all-true
masks and do the above strength reduction.

The branch profiling stuff is a good tool to apply
here.  Basically we want "v.method(w, MATM(m))"
where "MATM" is a branch-predicting intrinsic
operator which says "this is a mostly all-true
mask".  On the fast path out of the operator,
the mask can be hardwired to a true constant,
and then strength reduction to an unmasked
instruction is just a few more steps.

Given such a "MATM" operator, we can then
split loops into fast and slow versions, based on
that operator's oracular advice.  The slow loop
version will probably run once and exit at the
end, although it might also be useful for a
pre-loop.

I think I'm saying stuff here that all of us are already
thinking and discussing, but I wanted to get it out
clearly, just in case there's a point here that has
been overlooked.

> For slow path, masked operation will be intrinsified into corresponding instructions on mask-supported platform.
> Otherwise it will go scalar.

In some cases we might be able to get a two-step
fallback, first to a second set of vector instructions,
and then to scalar code.  The Java code wants to say
say things like "use an aggressive masked form" but
back off to a default implementation which still produces
vectorized code, but with explicit blends or scatters or
whatever to emulate the masked semantics.  After that
it can go scalar, if the type profile is bad or the hardware
is not present.

> But from review comments on my first patch [1], if mask operation is not directly supported by hardware, 
> scalar on slow path will cause problems if it can't prove access is always in-bounds and prune the slow path.

I think this is the sort of thing that motivates loop splitting.
If a slow path can't be pruned it can be pushed into the
bad slow loop.  If that bad slow loop is the post-loop, it's
only going to go slow on the last partial iteration of the
original loop.

> I think the problem is, without uncommon trap, VectorBox will be generated in fast path, for example, after blend for fromArray.

(Without uncommon trap, or loop splitting that creates
a slow version of the loop to catch bad stuff.)

> This results in Vector will be stored into memory instead of simd register. Is my understanding correct?

Vladimir would know for sure but this sounds right.

— John

P.S. What do you ARM experts think about having v.div(w)
throw ArithmeticException when any lane in w is zero,
for non-float types only?  I know ARM forces a non-exceptional
result there, but that's not Java-like at all.  In general, are
we comfortable with adding exception exits to vector operations,
like divide-by-zero and AIOOB (array index out of bounds)
on scatter/gather?