RFR (M) 8222074: Enhance auto vectorization for x86

Tue Apr 9 22:24:13 UTC 2019

On Apr 9, 2019, at 3:02 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
> On 09/04/2019 12:53, John Rose wrote:
>> I agree that the AD file combinatorics need to be tamed
>> more aggressively in this way.
>> I have to wonder if this could be done during incubation,
>> as a cleanup of tech. debt, so we can get experience with
>> the API at the same time.
> 
> I believe it depends on how severe static footprint increase will be.
> 
> Last time I checked [1], Vector API-related changes (w/o SVML stubs) contributed ~3Mb/15% to libjvm.so on Linux.
> 
> It would be very helpful to have more detailed and up-to-date information on that.

I agree.

And not to pre-empt what the numbers actually say, but
I think a parametric, data-driven backend is worth considering,
either as a real design, or as an ideal model for what we need
to tame the complexity of vector opcodes.

Most vector operations are parameterized by vector size, lane size,
and per-lane operation, and those three parameters are largely
independent of each other.  This suggests to me that we could
remove intermediate layers of distinction by plumbing uOp and
bOp straight through to the backend, equipped with a numeric
code derived from the FUnOp/FBinOp instance, plus bit sizes
for the vector and the lane.

vsize : {64,128,256,…,MAX_VSIZE}
lsize : {8,16,32,64} and maybe {1,2,4,128,…}
op : { and, or, xor, … , iadd, isub, icmp, …,
         fadd, fsub, fcmp, …,
         crc32, clmul, aes_enc_step, … }

Maybe that's too monolithic to actually implement,
but I think it is an ideal for taming ISA complexity,
by breaking it into partially independent components.

And I *do* have an agenda here to make room for
the "snowflake" operations which don't make sense
for scalars but are reasonably portable.  Such operations
are slowly growing in number, and have important
applications.  So the set of opcodes is not as simple
one would think at first glance; it's not your grandpa's
calculator.

— John

P.S. Another degree of freedom is the presence of masking, and on
how to materialize the unmasked lanes, zero, destination, or
second source.  Also whether the kind of the mask:  Bitmask,
vector-mask, or index range for loops.  Not sure how to slice
all that, but it seems that identifying the degrees of freedom
helps us design a useful framework.

P.P.S. Another place where small-integer codes might shine (besides
opcodes) is with shuffle generation.  The shiftEL operation is
really a cross-lane shift, parameterized by a small integer between
zero and the number of lanes (minus one).  This is almost but
not quite a constant shuffle operation; it's a shuffle across a
preset small family of shuffles, and it could be refactored as
a selection from a small family of "shuffle ops" which either
materialize into shuffle constants or are pattern-matched into
a special hardware instruction.  And there are lots of other
shuffle ops that can be special-cased, if you (a) look at use
cases for special shuffles, then (b) figure out how to factor
and parameterize the subspace of relevant shuffles, and
then (c) go find relevant special instructions and write the
code generator with built-in strength reduction when the
relevant shuffle instances appear.