[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Tue Aug 25 12:12:38 UTC 2020

> I can understand that a total solution for different archs and vector 
> sizes is preferable. Do you have any initial idea how to achieve that?

I have only ideas right now (unfortunately) :-)

So far, my observations from working on refactoring vector support on 
x86 with Intel folks are the following:

   (1) full-width register representation is good enough;

Though on x86 all vector registers are accurately modeled (register 
masks properly track sizes and aliasing), it turns out that what matters 
in practice is aliasing.

So, it's enough to use a single "virtual" slot to model XMM, YMM, and 
ZMM registers all at once unless RA supports packing multiple smaller 
vector values into a single register (separately managing lower and 
upper parts of the register; e.g., YMM = XMM(hi):XMM(lo) ). Though 
currently RA does support it, there are no code which utilizes that and 
no plans to do that in the future.

I believe the situation on AArch64 with NEON and SVE is similar. (And 
scalable vectors make it harder to support packing in RA.)

   (2) vector width matters only for spills/refills and reg2reg moves.

Matcher does type capturing, so all vector mach nodes keep precise type 
of the value they produce. On x86 it is heavily used later in code 
emission phase, but RA still relies on ideal registers (Op_VecX et al). 
I don't see why RA can't be migrated from ideal registers to types 
(TypeVect) to determine vector size when performing spilling.

 From aforementioned observations, I conclude there should be a way to 
declare a single ideal vector register (Op_Vec) which represents 
full-width vector supported by the hardware and use captured vector 
types (TypeVect instances) to guide RA and code generation. And that's 
the state where I'd like to see vector support in C2 be moving to.

Regarding predicate registers, I haven't thought too much about them, so 
I don't have a strong opinion about whether they should be a separate 
entity (Op_RegVMask in your patch) or just treated as a vector of bits 
(Op_Vec).

>> So far, I see 2 main directions for RA work:
>>
>>    (a) support vectors of arbitrary size:
>>      (1) helps push the upper limit on the size (1024-bit)
>>      (2) handle non-power-of-2 sizes
>>
>>    (b) optimize RA implementation for large values
>>
>> Anything else?
>>
> 
> Yes, and it's not just vector. SVE predicate register has scalable size 
> (vector_size/8) as well. We also have predicate register allocator 
> support well with proposed approach (not in this patch.).

Though with AVX512 support predicate register support was left aside, I 
agree that predicate registers should be taken into account from the 
very beginning. (And glad to hear you are already working on supporting 
them!)

Also, I believe options #1/#2 may be extended to cover predicate 
registers as well without too much effort.

>> Speaking of (a), in particular, I don't see why possible solution for 
>> it should not supersede vecX et al altogether.
>>
>> Also, I may be wrong, but I don't see a clear evidence there's a 
>> pressing need to have all of that fixed right from the beginning. 
>> (That's why I put #1 and #2 options on the table.) Starting with #1/#2 
>> would untie initial SVE support from the exploratory work needed to 
>> choose the most appropriate solution for (a) and (b).
>>
> 
> Staring from partial SVE register support might be acceptable for 
> initial patch (Andrew may not agree :-)), but I think we may end up with 
> more follow-up work, given that our proposed approach already supports 
> SVE well in terms of (a) and (b). If there's no other solution, would it 
> be possible to use current proposed method? It's not difficult to 
> backout our changes in register allocation part, if we find other better 
> solution to support arbitrary vector/predicate sizes in future, as the 
> patch there is actually not big IMO.

Unfortunately, temporary solutions usually end up as permanent ones 
since there's much less motivation to replace them (and harder to 
justify the effort) after initial pressure is relieved.

I'm OK with the proposed patch if we agree it's a stop-the-gap/temporary 
solution to the immediate problems you face with initial SVE support and 
are ready to commit resources into replacing it.

That's why I think it's the right time to discuss general direction, 
work on a plan, and use it to guide the coordinated effort to improve 
vector support in C2.

Also, considering it a stop-the-gap solution means we should strive for 
the simplest solution and that's another reason I put #1/#2 options on 
the table to consider.

[...]

>> Any new problems/hitting some limitations envisioned when spilling 
>> large number of huge vectors (2048-bit) on stack?
>>
> 
> I haven't seen any so far.

Ok, good to know.

I was curious whether stack representation should also move away from 
32-bit slots to a more compact representation.

Best regards,
Vladimir Ivanov