[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Tue Aug 25 10:07:30 UTC 2020

Hi Vladimir,

On 8/24/20 8:03 PM, Vladimir Ivanov wrote:
> Hi Ningsheng,
> 
>>> What I see in the patch is that you try to attack the problem from the
>>> opposite side: you introduce new concept of a size-agnostic vector
>>> register on RA side and then directly use it during matching: vecA is
>>> used in aarch64_sve.ad and aarch64.ad relies on vecD/vecX.
>>>
>>> Unfortunately, it extends the implementation in orthogonal direction
>>> which looks too aarch64-specific to benefit other architectures and x86
>>> particular. I believe there's an alternative approach which can benefit
>>> both aarch64 and x86, but it requires more experimentation.
>>>
>>
>> Since vecA and vecX (and others) are architecturally different vector 
>> registers, I think it's quite natural that we just introduced the new 
>> vector register type vecA, to represent what we need for corresponding 
>> hardware vector register. Please note that in vector length agnostic 
>> ISA, like Arm SVE and RISC-V vector extension [1], the vector 
>> registers are architecturally the same type of register despite the 
>> different hardware implementations.
> 
> FTR vecX et al don't represent hardware registers, they represent vector 
> values of predefined size. (For example, vecS, vecD, and vecX map to the 
> very same set of 128-bit vector registers on x86.)
> 
> My point is: in terms of existing concepts what you are adding is not 
> "yet another flavor of vector". It's a new full-fledged concept (which 
> is manifested as special cases across the JVM) and you end up with 2 
> different representations of vectors.
> 
> I agree that hardware is quite different, but I don't see it makes much 
> of a difference in the context of the JVM and abstractions used to hide 
> it are similar.
> 
> For example, as of now, most of x86-specific code in C2 works just fine 
> with full-width hardware vectors which are oblivious of their sizes 
> until RA kicks in. And SVE patch you propose completely omits implicit 
> predication hardware provides which makes it similar to AVX512 (modulo 
> wider range of vector width sizes supported).
> 
> So, even though hardware abstractions being used aren't actually *that* 
> different, vecA piles complexity and introduces a separate way to 
> achieve similar results (but slightly differently). And that's what 
> bothers me. I'd like to see more unification instead which should bring 
> reduction in complexity and an opportunity to address long-standing 
> technical debt (and 5 flavors of ideal registers for vectors is part of 
> it IMO).
> 

I can understand that a total solution for different archs and vector 
sizes is preferable. Do you have any initial idea how to achieve that?

> So far, I see 2 main directions for RA work:
> 
>    (a) support vectors of arbitrary size:
>      (1) helps push the upper limit on the size (1024-bit)
>      (2) handle non-power-of-2 sizes
> 
>    (b) optimize RA implementation for large values
> 
> Anything else?
> 

Yes, and it's not just vector. SVE predicate register has scalable size 
(vector_size/8) as well. We also have predicate register allocator 
support well with proposed approach (not in this patch.).

> Speaking of (a), in particular, I don't see why possible solution for it 
> should not supersede vecX et al altogether.
> 
> Also, I may be wrong, but I don't see a clear evidence there's a 
> pressing need to have all of that fixed right from the beginning. 
> (That's why I put #1 and #2 options on the table.) Starting with #1/#2 
> would untie initial SVE support from the exploratory work needed to 
> choose the most appropriate solution for (a) and (b).
> 

Staring from partial SVE register support might be acceptable for 
initial patch (Andrew may not agree :-)), but I think we may end up with 
more follow-up work, given that our proposed approach already supports 
SVE well in terms of (a) and (b). If there's no other solution, would it 
be possible to use current proposed method? It's not difficult to 
backout our changes in register allocation part, if we find other better 
solution to support arbitrary vector/predicate sizes in future, as the 
patch there is actually not big IMO.

>>> If I were to start from scratch, I would choose between 3 options:
>>>
>>>     #1: reuse existing VecX/VecY/VecZ ideal registers and limit 
>>> supported
>>> vector sizes to 128-/256-/512-bit values.
>>>
>>>     #2: lift limitation on max size (to 1024/2048 bits), but ignore
>>> non-power-of-2 sizes;
>>>
>>>     #3: introduce support for full range of vector register sizes
>>> (128-/.../2048-bit with 128-bit step);
>>>
>>> I see 2 (mostly unrelated) limitations: maximum vector size and
>>> non-power-of-2 sizes.
>>>
>>> My understanding is that you don't try to accurately represent SVE for
>>> now, but lay some foundations for future work: you give up on
>>> non-power-of-2 sized vectors, but still enable support for arbitrarily
>>> sized vectors (addressing both limitations on maximum size and size
>>> granularity) in RA (and it affects only spills). So, it is somewhere
>>> between #2 and #3.
>>>
>>> The ultimate goal is definitely #3, but how much more work will be
>>> required to teach the JVM about non-power-of-2 vectors? As I see in the
>>> patch, you don't have auto-vectorizer support yet, but Vector API will
>>> provide access to whatever size hardware exposes. What do you expect on
>>> hardware front in the near/mid-term future? Anything supporting vectors
>>> larger than 512-bit? What about 384-bit vectors?
>>>
>>
>> I think our patch is now in 3. :-) We do not give up non-power-of-2 
>> sized vectors, instead we are supporting them well in this patch. And 
>> are still using current regmask framework. (Actually, I think the only 
>> limitation to the vector size is that it should be multiple of 32-bits 
>> - bits per 1 reg slot.)
> 
>> I am not sure about other Arm partners' hardware implementations in 
>> the mid-term future, as it's free for cpu implementer to choose any 
>> max vector sizes as long as it follows SVE architecture specification. 
>> But we did tested the patch with Vector API on different SVE supported 
>> vector sizes on emulator, e.g. 384, 768, 1024, 2048 etc. The register 
>> allocator including the spill/unspill works well on those different 
>> sizes with Vector API. (Thanks to your great work on Vector API. :-))
>>
>> We currently limit the vector size to power-of-2 in 
>> vm_version_aarch64.cpp, as suggested by Andrew Dinn, is because 
>> current SLP vectorizer only supports power-of-2 vectors. With Vector 
>> API in, I think such restriction can be removed. And we are also 
>> working on a new vectorizer to support predication/mask, which should 
>> not have power-of-2 limitation.
> 
> [...]
> 
>> Yes, we can make JVM to support portion of vectors, at least for SVE. 
>> My concern is that the performance wouldn't be as good as the full 
>> available vector width.
> 
> To be clear: I called it "somewhere between #2 and #3" solely because 
> auto-vectorizer bails out on non-power-of-2 sizes. And even though 
> Vector API will work with such cases just fine, IMO having 
> auto-vectorizer support is required before calling #3 complete.
> 
> In that respect, choosing smaller vector size auto-vectorizer supports 
> is preferrable to picking up the full-width vectors and turning off 
> auto-vectorizer (even though Vector API will support them).
> 
> It can be turned into heuristic (by default, pick only power-of-2 sizes; 
> let users explicitly specify non-power-of-2 sizes), but speaking of 
> priorities, IMO auto-vectorizer support is more important.
> 

I agree that auto-vectorizer support is more important, and we are 
working on that.

>>> Giving up on #3 for now and starting with less ambitious goals (#1 or
>>> #2) would reduce pressure on RA and give more time for additional
>>> experiments to come with a better and more universal
>>> support/representation of generic/size-agnostic vectors. And, in a
>>> longer term, help reducing complexity and technical debt in the area.
>>>
>>> Some more comments follow inline.
>>>
>>>>> Compared to x86 w/ AVX512, architectural state for vector registers is
>>>>> 4x larger in the worst case (ignoring predicate registers for now).
>>>>> Here are the relevant constants on x86:
>>>>>
>>>>> gensrc/adfiles/adGlobals_x86.hpp:
>>>>>
>>>>> // the number of reserved registers + machine registers.
>>>>> #define REG_COUNT    545
>>>>> ...
>>>>> // Size of register-mask in ints
>>>>> #define RM_SIZE 22
>>>>>
>>>>> My estimate is that for AArch64 with SVE support the constants will 
>>>>> be:
>>>>>
>>>>>     REG_COUNT < 2500
>>>>>     RM_SIZE < 100
>>>>>
>>>>> which don't look too bad.
>>>>>
>>>>
>>>> Right, but given that most real hardware implementations will be no
>>>> larger than 512 bits, I think. Having a large bitmask array, with most
>>>> bits useless, will be less efficient for regmask computation.
>>>
>>> Does it make sense to limit the maximum supported size to 512-bit then
>>> (at least, initially)? In that case, the overhead won't be worse it is
>>> on x86 now.
>>>
>>
>> Technically, this may be possible though I haven't tried. My concerns 
>> are:
>>
>> 1) A larger regmask arrays would be less efficient (we only use 256 
>> bits - 8 slots for SVE in this patch), though won't be worse than x86.
>>
>> 2) Given that current patch already supports larger sizes and 
>> non-power-of-2 sizes well with relative small size in diff, if we want 
>> to support other sizes soon, there may be some more work to roll-back 
>> ad file changes.
>>
>>>>> Also, I don't see any changes related to stack management. So, I
>>>>> assume it continues to be managed in slots. Any problems there? As I
>>>>> understand, wide SVE registers are caller-save, so there may be many
>>>>> spills of huge vectors around a call. (Probably, not possible with C2
>>>>> auto-vectorizer as it is now, but Vector API will expose it.)
>>>>>
>>>>
>>>> Yes, the stack is still managed in slots, but it will be allocated with
>>>> real vector register length instead of 'virtual' slots for VecA. See 
>>>> the
>>>> usages of scalable_reg_slots(), e.g. in chaitin.cpp:1587. We have also
>>>> applied the patch to vector api, and did find a lot of vector spills
>>>> with expected correct results.
>>>
>>> I'm curious whether similar problems may arise for spills. Considering
>>> wide vector registers are caller-saved, it's possible to have lots of
>>> 256-byte values to end up on stack (especially, with Vector API). Any
>>> concerns with that?
>>>
>>
>> No, we don't need to have such big (256-byte) slots for a smaller 
>> vector register. The spill slots are the same size as of real vector 
>> length, e.g. 48 bytes for 384-bit vector. Even for alignment, we 
>> currently choose SlotsPerVecA (8 slots for 32 bytes, 256 bits) for 
>> alignment (skipped slots can still be allocated to other args), which 
>> is still smaller than AVX512 (64 bytes, 512 bits). We can tweak the 
>> patch to choose other smaller value, if we think the alignment is too 
>> large. (Yes, we should always try to avoid spills for wide vectors, 
>> especially with Vector API, to avoid performance pitfalls.)
> 
> Thanks for the clarifications.
> 
> Any new problems/hitting some limitations envisioned when spilling large 
> number of huge vectors (2048-bit) on stack?
> 

I haven't seen any so far.

> Best regards,
> Vladimir Ivanov
> 
>>>>> Have you noticed any performance problems? If that's the case, then
>>>>> AVX512 support on x86 would benefit from similar optimization as well.
>>>>>
>>>>
>>>> Do you mean register allocation performance problems? I did not notice
>>>> that before. Do you have any suggestion on how to measure that?
>>>
>>> I'd try to run some applications/benchmarks with -XX:+CITime to get a
>>> sense how much RA may be affected.
>>>
>>
>> Thanks! I will give a try.
>>
>> [1] 
>> https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/releases/tag/0.9__;!!GqivPVa7Brio!IwFEx-c_8JDZcWgXPLcWp2ypX3pr1-IWTBfC7O7PHo7_0skMWtQa4fyWpo-lVor0NFv4Ivo$ 
>>

Thanks,
Ningsheng