[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Wed Aug 26 09:31:41 UTC 2020

Hi Vladimir,

On 8/25/20 8:12 PM, Vladimir Ivanov wrote:
> 
[...]
> 
> So, it's enough to use a single "virtual" slot to model XMM, YMM, and 
> ZMM registers all at once unless RA supports packing multiple smaller 
> vector values into a single register (separately managing lower and 
> upper parts of the register; e.g., YMM = XMM(hi):XMM(lo) ). Though 
> currently RA does support it, there are no code which utilizes that and 
> no plans to do that in the future.
> 
> I believe the situation on AArch64 with NEON and SVE is similar. (And 
> scalable vectors make it harder to support packing in RA.)
> 

Right.

>    (2) vector width matters only for spills/refills and reg2reg moves.
> 
> Matcher does type capturing, so all vector mach nodes keep precise type 
> of the value they produce. On x86 it is heavily used later in code 
> emission phase, but RA still relies on ideal registers (Op_VecX et al). 
> I don't see why RA can't be migrated from ideal registers to types 
> (TypeVect) to determine vector size when performing spilling.
> 
>  From aforementioned observations, I conclude there should be a way to 
> declare a single ideal vector register (Op_Vec) which represents 
> full-width vector supported by the hardware and use captured vector 
> types (TypeVect instances) to guide RA and code generation. And that's 
> the state where I'd like to see vector support in C2 be moving to.
> 

That may be true. I think we can move forward step-by-step for easy 
maintenance.

> Regarding predicate registers, I haven't thought too much about them, so 
> I don't have a strong opinion about whether they should be a separate 
> entity (Op_RegVMask in your patch) or just treated as a vector of bits 
> (Op_Vec).
> 
>>> So far, I see 2 main directions for RA work:
>>>
>>>    (a) support vectors of arbitrary size:
>>>      (1) helps push the upper limit on the size (1024-bit)
>>>      (2) handle non-power-of-2 sizes
>>>
>>>    (b) optimize RA implementation for large values
>>>
>>> Anything else?
>>>
>>
>> Yes, and it's not just vector. SVE predicate register has scalable 
>> size (vector_size/8) as well. We also have predicate register 
>> allocator support well with proposed approach (not in this patch.).
> 
> Though with AVX512 support predicate register support was left aside, I 
> agree that predicate registers should be taken into account from the 
> very beginning. (And glad to hear you are already working on supporting 
> them!)
>

As that's one of the main feature of SVE, we have to do that. :-) With 
initial SVE support in, our further work on that could be easier.

> Also, I believe options #1/#2 may be extended to cover predicate 
> registers as well without too much effort.
> 
>>> Speaking of (a), in particular, I don't see why possible solution for 
>>> it should not supersede vecX et al altogether.
>>>
>>> Also, I may be wrong, but I don't see a clear evidence there's a 
>>> pressing need to have all of that fixed right from the beginning. 
>>> (That's why I put #1 and #2 options on the table.) Starting with 
>>> #1/#2 would untie initial SVE support from the exploratory work 
>>> needed to choose the most appropriate solution for (a) and (b).
>>>
>>
>> Staring from partial SVE register support might be acceptable for 
>> initial patch (Andrew may not agree :-)), but I think we may end up 
>> with more follow-up work, given that our proposed approach already 
>> supports SVE well in terms of (a) and (b). If there's no other 
>> solution, would it be possible to use current proposed method? It's 
>> not difficult to backout our changes in register allocation part, if 
>> we find other better solution to support arbitrary vector/predicate 
>> sizes in future, as the patch there is actually not big IMO.
> 
> Unfortunately, temporary solutions usually end up as permanent ones 
> since there's much less motivation to replace them (and harder to 
> justify the effort) after initial pressure is relieved.
> 
> I'm OK with the proposed patch if we agree it's a stop-the-gap/temporary 
> solution to the immediate problems you face with initial SVE support and 
> are ready to commit resources into replacing it.
> 

Yes, we will continue to maintain and improve it. Our idea might be Arm 
biased :), so we will need collaborations and suggestions from the 
community.

> That's why I think it's the right time to discuss general direction, 
> work on a plan, and use it to guide the coordinated effort to improve 
> vector support in C2.
> 
> Also, considering it a stop-the-gap solution means we should strive for 
> the simplest solution and that's another reason I put #1/#2 options on 
> the table to consider.
>  > [...]
> 
>>> Any new problems/hitting some limitations envisioned when spilling 
>>> large number of huge vectors (2048-bit) on stack?
>>>
>>
>> I haven't seen any so far.
> 
> Ok, good to know.
> 
> I was curious whether stack representation should also move away from 
> 32-bit slots to a more compact representation.
> 

I think that's possible, if we could also have the alignment handled.

Thanks,
Ningsheng