[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Tue Aug 25 13:18:12 UTC 2020

Hi Andrew,

I elaborated on some of the points in the thread with Ningsheng.

I put my responses in-line, but will try to avoid repeating myself too much.

>>> The ultimate goal was to move to vectors which represent full-width
>>> hardware registers. After we were convinced that it will work well in AD
>>> files, we encountered some inefficiencies with vector spills: depending
>>> on actual hardware, smaller (than available) vectors may be used (e.g.,
>>> integer computations on AVX-capable CPU). So, we stopped half-way and
>>> left post-matching part intact: depending on actual vector value width,
>>> appropriate operand (vecX/vecY/vecZ + legacy variants) is chosen.
>>>
>>> (I believe you may be in a similar situation on AArch64 with NEON vs SVE
>>> where both 128-bit and wide SVE vectors may be used at runtime.)
> 
> Your problem here seems to be a worry about spilling more data than is
> actually needed. As Ningsheng pointed out the amount of data spilled is
> determined by the actual length of the VecA registers, not by the
> logical size of the VecA mask (256 bits) nor by the maximum possible
> size of a VecA register on future architectures (2048 bits). So, no more
> stack space will be used than is needed to preserve the live bits that
> need preserving.

I described the experience with doing a similar exercise on x86: 
migrating away from [leg]vec[SDXYZ] operands to a uniform size-agnostic 
representation (legVec/vec). The only problem with abandoning Op_VecX et 
al was the need to track the size of vector values in RA.

>>> Unfortunately, it extends the implementation in orthogonal direction
>>> which looks too aarch64-specific to benefit other architectures and x86
>>> particular. I believe there's an alternative approach which can benefit
>>> both aarch64 and x86, but it requires more experimentation.
>>>
>>
>> Since vecA and vecX (and others) are architecturally different vector
>> registers, I think it's quite natural that we just introduced the new
>> vector register type vecA, to represent what we need for corresponding
>> hardware vector register. Please note that in vector length agnostic
>> ISA, like Arm SVE and RISC-V vector extension [1], the vector registers
>> are architecturally the same type of register despite the different
>> hardware implementations.
> 
> Yes, I also see this as quite natural. Ningsheng's change extends the
> implementation in the architecture-specific direction that is needed for
> AArch64's vector model. The fact that this differs from x86_64 is not
> unexpected.

And still C2 can model them in a similar way. Moreover, recent changes 
on x86 I described brings x86 very close to SVE. (I elaborated on that 
in the previous response to Ningsheng.)

>>> If I were to start from scratch, I would choose between 3 options:
>>>
>>>      #1: reuse existing VecX/VecY/VecZ ideal registers and limit supported
>>> vector sizes to 128-/256-/512-bit values.
>>>
>>>      #2: lift limitation on max size (to 1024/2048 bits), but ignore
>>> non-power-of-2 sizes;
>>>
>>>      #3: introduce support for full range of vector register sizes
>>> (128-/.../2048-bit with 128-bit step);
>>>
>>> I see 2 (mostly unrelated) limitations: maximum vector size and
>>> non-power-of-2 sizes.
> 
> Yes, but this patch deals with both of those and I cannot see it causing
> any problems for x86_64 nor do I see it adding any great complexity. The
> extra shard paths deal with scalable vectors wich onlu occur on AArch64.
> A scalable VecA register (and also eventually the scalable predicate
> register) caters for all possible vector sizes via a single 'logical'
> vector of size 8 slots (also eventually a single 'logical' predicate
> register of size 1 slot). Catering for scalable registers in shared code
> is localized and does not change handling of the existing, non-scalable
> VecX/Y/Z registers.

Code needed for vector support in C2 has been growing in size over the 
years and now it comprises a noticeable part of the compiler. And it got 
there through relatively small incremental and localized changes.

I agree that the proposed solution demonstrates a very clever way to 
overcome some of the limitations imposed by existing implementation. But 
it is still a workaround which only emphasizes the architectural 
limitations. And it's not specific to AArch64 with SVE: x86 stretches it 
hard as well (though in a slightly different direction) which FTR forced 
recent migration to "generic vectors".

So, instead of proceeding with incremental changes and accumulating 
complexity (and technical debt along the way), I suggest to look into 
reworking vector support and making it relevant to the modern hardware 
(both x86 and AArch64).

>>> My understanding is that you don't try to accurately represent SVE for
>>> now, but lay some foundations for future work: you give up on
>>> non-power-of-2 sized vectors, but still enable support for arbitrarily
>>> sized vectors (addressing both limitations on maximum size and size
>>> granularity) in RA (and it affects only spills). So, it is somewhere
>>> between #2 and #3.
> 
> I have to disagree with your statement that this proposal doesn't
> 'accurately' represent SVE. Yes, the vector mask for this arbitrary-size
> vector is modelled 'logically' using a nominal 8 slots. However, that is
> merely to avoid wasting bits in the bit masks plus cpu time processing
> them. The 'physical' vector length models the actual number of slots,
> and includes the option to model a non-power of two. That 'physical'
> size is used in all operations that manipulate VecA register contents.
> So, although I grant that the code is /parameterized/, it is also 100%
> accurate.

My point is: the proposed solution makes a number of simplifying 
assumptions which makes it much easier to support SVE (e.g., VecA 
represents full-width vector which completely ignores implicit 
predication provided by the ISA).

>>> The ultimate goal is definitely #3, but how much more work will be
>>> required to teach the JVM about non-power-of-2 vectors? As I see in the
>>> patch, you don't have auto-vectorizer support yet, but Vector API will
>>> provide access to whatever size hardware exposes. What do you expect on
>>> hardware front in the near/mid-term future? Anything supporting vectors
>>> larger than 512-bit? What about 384-bit vectors?
> 
> Do we need to know for sure such hardware is going to arrive in order to
> allow for it now? If there were a significant cost to doing so I'd maybe
> say yes but I don't really see one here. Most importantly, the changes
> to the AArch64 register model and small changes to the shared
> chaitin/reg mask code proposed here already work with the
> auto-vectorizer if the VecA slots are any of the possible powers of 2
> VecA sizes.
> 
> The extra work needed to profit from non-power-of-two vector involves
> upgrading the auto-vectorizer code. While this may be tricky I don't see
> ti as impossible. However, more importantly, even if such an upgrade
> cannot be achieved then this proposal is still a very simple way to
> allow for arbitrarily scalable SVE vectors that are a power of two size.
> It also allows any architecture with a non-power of two to work with the
> lowest power of two that fits. So, this is a very siple way to cater for
> what may turn up.

If it makes options #1/#2 viable, then there's no need to change shared 
code at all. Choosing between no code changes and low risk / small code 
changes which won't be used in practice, I'm strongly in favor of the 
former.

>>> For larger vectors #2 (or a mix of #1 and #2) may be a good fit. My
>>> understanding that existing RA machinery should support 1024-bit vectors
>>> well. So, unless 2048-bit vectors are needed, we could live with the
>>> framework we have right now.
> 
> I'm not sure what you are proposing here but it sounds like introducing
> extra vectors beyond VecX, VecY for larger powers of two i.e. VecZ,
> vecZZ, VecZZZ ... and providing separate case processing for each of
> them where the relevant case is selected conditional on the actual
> vector size. Is that what you are proposing? I can't see any virtue in
> multiplying case handling fore ach new power-of-two size that turns up
> when all possible VecZ* power-of-two options can actually be handled as
> one uniform case.

Option #1 doesn't require anything more than Vec[SDXYZ].

Option #2 assumes 1 more operand & ideal register for 1024-bit. As 
Ningsheng pointed out, without introducing length-agnostic vectors, 
supporting 2048-bit vectors require changes in RegMask to accommodate 
for values spanning 64 slots.

>>> Giving up on #3 for now and starting with less ambitious goals (#1 or
>>> #2) would reduce pressure on RA and give more time for additional
>>> experiments to come with a better and more universal
>>> support/representation of generic/size-agnostic vectors. And, in a
>>> longer term, help reducing complexity and technical debt in the area.
> 
> Can you explain what you mean by 'reduce pressure on RA'? I'm also
> unclear as to what you see as complex about this proposal.

IMO vector support already introduces significant complexity in C2. 
Adding platform-specific features will only increase it. So, I'm in 
favor of reworking the support than applying band-aids to relax some 
inherent limitations of it.

>>> Some more comments follow inline.
>>>
>>>>> Compared to x86 w/ AVX512, architectural state for vector registers is
>>>>> 4x larger in the worst case (ignoring predicate registers for now).
>>>>> Here are the relevant constants on x86:
>>>>>
>>>>> gensrc/adfiles/adGlobals_x86.hpp:
>>>>>
>>>>> // the number of reserved registers + machine registers.
>>>>> #define REG_COUNT    545
>>>>> ...
>>>>> // Size of register-mask in ints
>>>>> #define RM_SIZE 22
>>>>>
>>>>> My estimate is that for AArch64 with SVE support the constants will be:
>>>>>
>>>>>      REG_COUNT < 2500
>>>>>      RM_SIZE < 100
>>>>>
>>>>> which don't look too bad.
> 
> I'm not sure what these numbers are meant to mean. The number of SVE
> vector registers is the same as the number of NEON vector registers i.e.
> 32. The register mask size for VecA registers is 8 * 32 bits.

I attempted to estimate the sizes of relevant structures if VecA is 
modelled the same way as VecX et al.

>>>> Right, but given that most real hardware implementations will be no
>>>> larger than 512 bits, I think. Having a large bitmask array, with most
>>>> bits useless, will be less efficient for regmask computation.
>>>
>>> Does it make sense to limit the maximum supported size to 512-bit then
>>> (at least, initially)? In that case, the overhead won't be worse it is
>>> on x86 now.
> 
> Well, no. It doesn't make sense when all you need is a 'logical' 8 * 32
> bit mask whatever the actual 'physical' register size is.

I asked that question in a different context trying to get a sense of 
other simplifying assumptions which could be made in the initial 
implementation.

But you should definitely prefer 1-slot design for vector registers then 
;-)

Best regards,
Vladimir Ivanov