[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Mon Aug 24 09:16:07 UTC 2020

Hi Vladimir,

Thanks for your valuable inputs!

On 8/22/20 6:34 AM, Vladimir Ivanov wrote:
> Thanks for clarifications, Ningsheng.
> 
> Let me share my thoughts on the topic and I'll start with summarizing
> the experience of migrating x86 code to generic vectors.
> 
> JVM has quite a bit of special logic to support vectors. It hasn't
> exhausted the complexity budget yet, but it's quite close to the limit
> (as you probably noticed). While extending x86 backend to support Vector
> API, we pushed it over the limit and had to address some of the issues.
> 
> The ultimate goal was to move to vectors which represent full-width
> hardware registers. After we were convinced that it will work well in AD
> files, we encountered some inefficiencies with vector spills: depending
> on actual hardware, smaller (than available) vectors may be used (e.g.,
> integer computations on AVX-capable CPU). So, we stopped half-way and
> left post-matching part intact: depending on actual vector value width,
> appropriate operand (vecX/vecY/vecZ + legacy variants) is chosen.
> 
> (I believe you may be in a similar situation on AArch64 with NEON vs SVE
> where both 128-bit and wide SVE vectors may be used at runtime.)
> 

Thanks for sharing the background.

> Now back to the patch.
> 
> What I see in the patch is that you try to attack the problem from the
> opposite side: you introduce new concept of a size-agnostic vector
> register on RA side and then directly use it during matching: vecA is
> used in aarch64_sve.ad and aarch64.ad relies on vecD/vecX.
> 
> Unfortunately, it extends the implementation in orthogonal direction
> which looks too aarch64-specific to benefit other architectures and x86
> particular. I believe there's an alternative approach which can benefit
> both aarch64 and x86, but it requires more experimentation.
> 

Since vecA and vecX (and others) are architecturally different vector 
registers, I think it's quite natural that we just introduced the new 
vector register type vecA, to represent what we need for corresponding 
hardware vector register. Please note that in vector length agnostic 
ISA, like Arm SVE and RISC-V vector extension [1], the vector registers 
are architecturally the same type of register despite the different 
hardware implementations.

> If I were to start from scratch, I would choose between 3 options:
> 
>     #1: reuse existing VecX/VecY/VecZ ideal registers and limit supported
> vector sizes to 128-/256-/512-bit values.
> 
>     #2: lift limitation on max size (to 1024/2048 bits), but ignore
> non-power-of-2 sizes;
> 
>     #3: introduce support for full range of vector register sizes
> (128-/.../2048-bit with 128-bit step);
> 
> I see 2 (mostly unrelated) limitations: maximum vector size and
> non-power-of-2 sizes.
> 
> My understanding is that you don't try to accurately represent SVE for
> now, but lay some foundations for future work: you give up on
> non-power-of-2 sized vectors, but still enable support for arbitrarily
> sized vectors (addressing both limitations on maximum size and size
> granularity) in RA (and it affects only spills). So, it is somewhere
> between #2 and #3.
> 
> The ultimate goal is definitely #3, but how much more work will be
> required to teach the JVM about non-power-of-2 vectors? As I see in the
> patch, you don't have auto-vectorizer support yet, but Vector API will
> provide access to whatever size hardware exposes. What do you expect on
> hardware front in the near/mid-term future? Anything supporting vectors
> larger than 512-bit? What about 384-bit vectors?
> 

I think our patch is now in 3. :-) We do not give up non-power-of-2 
sized vectors, instead we are supporting them well in this patch. And 
are still using current regmask framework. (Actually, I think the only 
limitation to the vector size is that it should be multiple of 32-bits - 
bits per 1 reg slot.)

I am not sure about other Arm partners' hardware implementations in the 
mid-term future, as it's free for cpu implementer to choose any max 
vector sizes as long as it follows SVE architecture specification. But 
we did tested the patch with Vector API on different SVE supported 
vector sizes on emulator, e.g. 384, 768, 1024, 2048 etc. The register 
allocator including the spill/unspill works well on those different 
sizes with Vector API. (Thanks to your great work on Vector API. :-))

We currently limit the vector size to power-of-2 in 
vm_version_aarch64.cpp, as suggested by Andrew Dinn, is because current 
SLP vectorizer only supports power-of-2 vectors. With Vector API in, I 
think such restriction can be removed. And we are also working on a new 
vectorizer to support predication/mask, which should not have power-of-2 
limitation.

> I don't have a good understanding where SVE/SVE2-capable hardware is
> moving and would benefit a lot from your insights about what to expect.
> 
> If 256-/512-bit vectors end up as the only option, then #1 should fit
> them well.
> 
> For larger vectors #2 (or a mix of #1 and #2) may be a good fit. My
> understanding that existing RA machinery should support 1024-bit vectors
> well. So, unless 2048-bit vectors are needed, we could live with the
> framework we have right now.
> 
> If hardware has non-power-of-2 vectors, but JVM doesn't support them,
> then JVM can work with just power-of-2 portion of them (384-bit => 256-bit).
> 

Yes, we can make JVM to support portion of vectors, at least for SVE. My 
concern is that the performance wouldn't be as good as the full 
available vector width.

> Giving up on #3 for now and starting with less ambitious goals (#1 or
> #2) would reduce pressure on RA and give more time for additional
> experiments to come with a better and more universal
> support/representation of generic/size-agnostic vectors. And, in a
> longer term, help reducing complexity and technical debt in the area.
> 
> Some more comments follow inline.
> 
>>> Compared to x86 w/ AVX512, architectural state for vector registers is
>>> 4x larger in the worst case (ignoring predicate registers for now).
>>> Here are the relevant constants on x86:
>>>
>>> gensrc/adfiles/adGlobals_x86.hpp:
>>>
>>> // the number of reserved registers + machine registers.
>>> #define REG_COUNT    545
>>> ...
>>> // Size of register-mask in ints
>>> #define RM_SIZE 22
>>>
>>> My estimate is that for AArch64 with SVE support the constants will be:
>>>
>>>     REG_COUNT < 2500
>>>     RM_SIZE < 100
>>>
>>> which don't look too bad.
>>>
>>
>> Right, but given that most real hardware implementations will be no
>> larger than 512 bits, I think. Having a large bitmask array, with most
>> bits useless, will be less efficient for regmask computation.
> 
> Does it make sense to limit the maximum supported size to 512-bit then
> (at least, initially)? In that case, the overhead won't be worse it is
> on x86 now.
> 

Technically, this may be possible though I haven't tried. My concerns are:

1) A larger regmask arrays would be less efficient (we only use 256 bits 
- 8 slots for SVE in this patch), though won't be worse than x86.

2) Given that current patch already supports larger sizes and 
non-power-of-2 sizes well with relative small size in diff, if we want 
to support other sizes soon, there may be some more work to roll-back ad 
file changes.

>>> Also, I don't see any changes related to stack management. So, I
>>> assume it continues to be managed in slots. Any problems there? As I
>>> understand, wide SVE registers are caller-save, so there may be many
>>> spills of huge vectors around a call. (Probably, not possible with C2
>>> auto-vectorizer as it is now, but Vector API will expose it.)
>>>
>>
>> Yes, the stack is still managed in slots, but it will be allocated with
>> real vector register length instead of 'virtual' slots for VecA. See the
>> usages of scalable_reg_slots(), e.g. in chaitin.cpp:1587. We have also
>> applied the patch to vector api, and did find a lot of vector spills
>> with expected correct results.
> 
> I'm curious whether similar problems may arise for spills. Considering
> wide vector registers are caller-saved, it's possible to have lots of
> 256-byte values to end up on stack (especially, with Vector API). Any
> concerns with that?
> 

No, we don't need to have such big (256-byte) slots for a smaller vector 
register. The spill slots are the same size as of real vector length, 
e.g. 48 bytes for 384-bit vector. Even for alignment, we currently 
choose SlotsPerVecA (8 slots for 32 bytes, 256 bits) for alignment 
(skipped slots can still be allocated to other args), which is still 
smaller than AVX512 (64 bytes, 512 bits). We can tweak the patch to 
choose other smaller value, if we think the alignment is too large. 
(Yes, we should always try to avoid spills for wide vectors, especially 
with Vector API, to avoid performance pitfalls.)

>>> Have you noticed any performance problems? If that's the case, then
>>> AVX512 support on x86 would benefit from similar optimization as well.
>>>
>>
>> Do you mean register allocation performance problems? I did not notice
>> that before. Do you have any suggestion on how to measure that?
> 
> I'd try to run some applications/benchmarks with -XX:+CITime to get a
> sense how much RA may be affected.
> 

Thanks! I will give a try.

[1] https://github.com/riscv/riscv-v-spec/releases/tag/0.9

Thanks,
Ningsheng