[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Mon Aug 24 13:40:53 UTC 2020

On 24/08/2020 10:16, Ningsheng Jian wrote:

> On 8/22/20 6:34 AM, Vladimir Ivanov wrote:

>> The ultimate goal was to move to vectors which represent full-width
>> hardware registers. After we were convinced that it will work well in AD
>> files, we encountered some inefficiencies with vector spills: depending
>> on actual hardware, smaller (than available) vectors may be used (e.g.,
>> integer computations on AVX-capable CPU). So, we stopped half-way and
>> left post-matching part intact: depending on actual vector value width,
>> appropriate operand (vecX/vecY/vecZ + legacy variants) is chosen.
>>
>> (I believe you may be in a similar situation on AArch64 with NEON vs SVE
>> where both 128-bit and wide SVE vectors may be used at runtime.)

Your problem here seems to be a worry about spilling more data than is
actually needed. As Ningsheng pointed out the amount of data spilled is
determined by the actual length of the VecA registers, not by the
logical size of the VecA mask (256 bits) nor by the maximum possible
size of a VecA register on future architectures (2048 bits). So, no more
stack space will be used than is needed to preserve the live bits that
need preserving.

>> Unfortunately, it extends the implementation in orthogonal direction
>> which looks too aarch64-specific to benefit other architectures and x86
>> particular. I believe there's an alternative approach which can benefit
>> both aarch64 and x86, but it requires more experimentation.
>>
> 
> Since vecA and vecX (and others) are architecturally different vector
> registers, I think it's quite natural that we just introduced the new
> vector register type vecA, to represent what we need for corresponding
> hardware vector register. Please note that in vector length agnostic
> ISA, like Arm SVE and RISC-V vector extension [1], the vector registers
> are architecturally the same type of register despite the different
> hardware implementations.

Yes, I also see this as quite natural. Ningsheng's change extends the
implementation in the architecture-specific direction that is needed for
AArch64's vector model. The fact that this differs from x86_64 is not
unexpected.

>> If I were to start from scratch, I would choose between 3 options:
>>
>>     #1: reuse existing VecX/VecY/VecZ ideal registers and limit supported
>> vector sizes to 128-/256-/512-bit values.
>>
>>     #2: lift limitation on max size (to 1024/2048 bits), but ignore
>> non-power-of-2 sizes;
>>
>>     #3: introduce support for full range of vector register sizes
>> (128-/.../2048-bit with 128-bit step);
>>
>> I see 2 (mostly unrelated) limitations: maximum vector size and
>> non-power-of-2 sizes.

Yes, but this patch deals with both of those and I cannot see it causing
any problems for x86_64 nor do I see it adding any great complexity. The
extra shard paths deal with scalable vectors wich onlu occur on AArch64.
A scalable VecA register (and also eventually the scalable predicate
register) caters for all possible vector sizes via a single 'logical'
vector of size 8 slots (also eventually a single 'logical' predicate
register of size 1 slot). Catering for scalable registers in shared code
is localized and does not change handling of the existing, non-scalable
VecX/Y/Z registers.

>> My understanding is that you don't try to accurately represent SVE for
>> now, but lay some foundations for future work: you give up on
>> non-power-of-2 sized vectors, but still enable support for arbitrarily
>> sized vectors (addressing both limitations on maximum size and size
>> granularity) in RA (and it affects only spills). So, it is somewhere
>> between #2 and #3.

I have to disagree with your statement that this proposal doesn't
'accurately' represent SVE. Yes, the vector mask for this arbitrary-size
vector is modelled 'logically' using a nominal 8 slots. However, that is
merely to avoid wasting bits in the bit masks plus cpu time processing
them. The 'physical' vector length models the actual number of slots,
and includes the option to model a non-power of two. That 'physical'
size is used in all operations that manipulate VecA register contents.
So, although I grant that the code is /parameterized/, it is also 100%
accurate.

>> The ultimate goal is definitely #3, but how much more work will be
>> required to teach the JVM about non-power-of-2 vectors? As I see in the
>> patch, you don't have auto-vectorizer support yet, but Vector API will
>> provide access to whatever size hardware exposes. What do you expect on
>> hardware front in the near/mid-term future? Anything supporting vectors
>> larger than 512-bit? What about 384-bit vectors?

Do we need to know for sure such hardware is going to arrive in order to
allow for it now? If there were a significant cost to doing so I'd maybe
say yes but I don't really see one here. Most importantly, the changes
to the AArch64 register model and small changes to the shared
chaitin/reg mask code proposed here already work with the
auto-vectorizer if the VecA slots are any of the possible powers of 2
VecA sizes.

The extra work needed to profit from non-power-of-two vector involves
upgrading the auto-vectorizer code. While this may be tricky I don't see
ti as impossible. However, more importantly, even if such an upgrade
cannot be achieved then this proposal is still a very simple way to
allow for arbitrarily scalable SVE vectors that are a power of two size.
It also allows any architecture with a non-power of two to work with the
lowest power of two that fits. So, this is a very siple way to cater for
what may turn up.

>> For larger vectors #2 (or a mix of #1 and #2) may be a good fit. My
>> understanding that existing RA machinery should support 1024-bit vectors
>> well. So, unless 2048-bit vectors are needed, we could live with the
>> framework we have right now.

I'm not sure what you are proposing here but it sounds like introducing
extra vectors beyond VecX, VecY for larger powers of two i.e. VecZ,
vecZZ, VecZZZ ... and providing separate case processing for each of
them where the relevant case is selected conditional on the actual
vector size. Is that what you are proposing? I can't see any virtue in
multiplying case handling fore ach new power-of-two size that turns up
when all possible VecZ* power-of-two options can actually be handled as
one uniform case.

>> If hardware has non-power-of-2 vectors, but JVM doesn't support them,
>> then JVM can work with just power-of-2 portion of them (384-bit =>
>> 256-bit).

And, of course, the previous comment applies here /a fortiori/.

>> Giving up on #3 for now and starting with less ambitious goals (#1 or
>> #2) would reduce pressure on RA and give more time for additional
>> experiments to come with a better and more universal
>> support/representation of generic/size-agnostic vectors. And, in a
>> longer term, help reducing complexity and technical debt in the area.

Can you explain what you mean by 'reduce pressure on RA'? I'm also
unclear as to what you see as complex about this proposal.

>> Some more comments follow inline.
>>
>>>> Compared to x86 w/ AVX512, architectural state for vector registers is
>>>> 4x larger in the worst case (ignoring predicate registers for now).
>>>> Here are the relevant constants on x86:
>>>>
>>>> gensrc/adfiles/adGlobals_x86.hpp:
>>>>
>>>> // the number of reserved registers + machine registers.
>>>> #define REG_COUNT    545
>>>> ...
>>>> // Size of register-mask in ints
>>>> #define RM_SIZE 22
>>>>
>>>> My estimate is that for AArch64 with SVE support the constants will be:
>>>>
>>>>     REG_COUNT < 2500
>>>>     RM_SIZE < 100
>>>>
>>>> which don't look too bad.

I'm not sure what these numbers are meant to mean. The number of SVE
vector registers is the same as the number of NEON vector registers i.e.
32. The register mask size for VecA registers is 8 * 32 bits.

>>> Right, but given that most real hardware implementations will be no
>>> larger than 512 bits, I think. Having a large bitmask array, with most
>>> bits useless, will be less efficient for regmask computation.
>>
>> Does it make sense to limit the maximum supported size to 512-bit then
>> (at least, initially)? In that case, the overhead won't be worse it is
>> on x86 now.

Well, no. It doesn't make sense when all you need is a 'logical' 8 * 32
bit mask whatever the actual 'physical' register size is.

regards,

Andrew Dinn
-----------
Red Hat Distinguished Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill