[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Mon Aug 24 12:03:47 UTC 2020

Hi Ningsheng,

>> What I see in the patch is that you try to attack the problem from the
>> opposite side: you introduce new concept of a size-agnostic vector
>> register on RA side and then directly use it during matching: vecA is
>> used in aarch64_sve.ad and aarch64.ad relies on vecD/vecX.
>>
>> Unfortunately, it extends the implementation in orthogonal direction
>> which looks too aarch64-specific to benefit other architectures and x86
>> particular. I believe there's an alternative approach which can benefit
>> both aarch64 and x86, but it requires more experimentation.
>>
> 
> Since vecA and vecX (and others) are architecturally different vector 
> registers, I think it's quite natural that we just introduced the new 
> vector register type vecA, to represent what we need for corresponding 
> hardware vector register. Please note that in vector length agnostic 
> ISA, like Arm SVE and RISC-V vector extension [1], the vector registers 
> are architecturally the same type of register despite the different 
> hardware implementations.

FTR vecX et al don't represent hardware registers, they represent vector 
values of predefined size. (For example, vecS, vecD, and vecX map to the 
very same set of 128-bit vector registers on x86.)

My point is: in terms of existing concepts what you are adding is not 
"yet another flavor of vector". It's a new full-fledged concept (which 
is manifested as special cases across the JVM) and you end up with 2 
different representations of vectors.

I agree that hardware is quite different, but I don't see it makes much 
of a difference in the context of the JVM and abstractions used to hide 
it are similar.

For example, as of now, most of x86-specific code in C2 works just fine 
with full-width hardware vectors which are oblivious of their sizes 
until RA kicks in. And SVE patch you propose completely omits implicit 
predication hardware provides which makes it similar to AVX512 (modulo 
wider range of vector width sizes supported).

So, even though hardware abstractions being used aren't actually *that* 
different, vecA piles complexity and introduces a separate way to 
achieve similar results (but slightly differently). And that's what 
bothers me. I'd like to see more unification instead which should bring 
reduction in complexity and an opportunity to address long-standing 
technical debt (and 5 flavors of ideal registers for vectors is part of 
it IMO).

So far, I see 2 main directions for RA work:

   (a) support vectors of arbitrary size:
     (1) helps push the upper limit on the size (1024-bit)
     (2) handle non-power-of-2 sizes

   (b) optimize RA implementation for large values

Anything else?

Speaking of (a), in particular, I don't see why possible solution for it 
should not supersede vecX et al altogether.

Also, I may be wrong, but I don't see a clear evidence there's a 
pressing need to have all of that fixed right from the beginning. 
(That's why I put #1 and #2 options on the table.) Starting with #1/#2 
would untie initial SVE support from the exploratory work needed to 
choose the most appropriate solution for (a) and (b).

>> If I were to start from scratch, I would choose between 3 options:
>>
>>     #1: reuse existing VecX/VecY/VecZ ideal registers and limit supported
>> vector sizes to 128-/256-/512-bit values.
>>
>>     #2: lift limitation on max size (to 1024/2048 bits), but ignore
>> non-power-of-2 sizes;
>>
>>     #3: introduce support for full range of vector register sizes
>> (128-/.../2048-bit with 128-bit step);
>>
>> I see 2 (mostly unrelated) limitations: maximum vector size and
>> non-power-of-2 sizes.
>>
>> My understanding is that you don't try to accurately represent SVE for
>> now, but lay some foundations for future work: you give up on
>> non-power-of-2 sized vectors, but still enable support for arbitrarily
>> sized vectors (addressing both limitations on maximum size and size
>> granularity) in RA (and it affects only spills). So, it is somewhere
>> between #2 and #3.
>>
>> The ultimate goal is definitely #3, but how much more work will be
>> required to teach the JVM about non-power-of-2 vectors? As I see in the
>> patch, you don't have auto-vectorizer support yet, but Vector API will
>> provide access to whatever size hardware exposes. What do you expect on
>> hardware front in the near/mid-term future? Anything supporting vectors
>> larger than 512-bit? What about 384-bit vectors?
>>
> 
> I think our patch is now in 3. :-) We do not give up non-power-of-2 
> sized vectors, instead we are supporting them well in this patch. And 
> are still using current regmask framework. (Actually, I think the only 
> limitation to the vector size is that it should be multiple of 32-bits - 
> bits per 1 reg slot.)

> I am not sure about other Arm partners' hardware implementations in the 
> mid-term future, as it's free for cpu implementer to choose any max 
> vector sizes as long as it follows SVE architecture specification. But 
> we did tested the patch with Vector API on different SVE supported 
> vector sizes on emulator, e.g. 384, 768, 1024, 2048 etc. The register 
> allocator including the spill/unspill works well on those different 
> sizes with Vector API. (Thanks to your great work on Vector API. :-))
> 
> We currently limit the vector size to power-of-2 in 
> vm_version_aarch64.cpp, as suggested by Andrew Dinn, is because current 
> SLP vectorizer only supports power-of-2 vectors. With Vector API in, I 
> think such restriction can be removed. And we are also working on a new 
> vectorizer to support predication/mask, which should not have power-of-2 
> limitation.

[...]

> Yes, we can make JVM to support portion of vectors, at least for SVE. My 
> concern is that the performance wouldn't be as good as the full 
> available vector width.

To be clear: I called it "somewhere between #2 and #3" solely because 
auto-vectorizer bails out on non-power-of-2 sizes. And even though 
Vector API will work with such cases just fine, IMO having 
auto-vectorizer support is required before calling #3 complete.

In that respect, choosing smaller vector size auto-vectorizer supports 
is preferrable to picking up the full-width vectors and turning off 
auto-vectorizer (even though Vector API will support them).

It can be turned into heuristic (by default, pick only power-of-2 sizes; 
let users explicitly specify non-power-of-2 sizes), but speaking of 
priorities, IMO auto-vectorizer support is more important.

>> Giving up on #3 for now and starting with less ambitious goals (#1 or
>> #2) would reduce pressure on RA and give more time for additional
>> experiments to come with a better and more universal
>> support/representation of generic/size-agnostic vectors. And, in a
>> longer term, help reducing complexity and technical debt in the area.
>>
>> Some more comments follow inline.
>>
>>>> Compared to x86 w/ AVX512, architectural state for vector registers is
>>>> 4x larger in the worst case (ignoring predicate registers for now).
>>>> Here are the relevant constants on x86:
>>>>
>>>> gensrc/adfiles/adGlobals_x86.hpp:
>>>>
>>>> // the number of reserved registers + machine registers.
>>>> #define REG_COUNT    545
>>>> ...
>>>> // Size of register-mask in ints
>>>> #define RM_SIZE 22
>>>>
>>>> My estimate is that for AArch64 with SVE support the constants will be:
>>>>
>>>>     REG_COUNT < 2500
>>>>     RM_SIZE < 100
>>>>
>>>> which don't look too bad.
>>>>
>>>
>>> Right, but given that most real hardware implementations will be no
>>> larger than 512 bits, I think. Having a large bitmask array, with most
>>> bits useless, will be less efficient for regmask computation.
>>
>> Does it make sense to limit the maximum supported size to 512-bit then
>> (at least, initially)? In that case, the overhead won't be worse it is
>> on x86 now.
>>
> 
> Technically, this may be possible though I haven't tried. My concerns are:
> 
> 1) A larger regmask arrays would be less efficient (we only use 256 bits 
> - 8 slots for SVE in this patch), though won't be worse than x86.
> 
> 2) Given that current patch already supports larger sizes and 
> non-power-of-2 sizes well with relative small size in diff, if we want 
> to support other sizes soon, there may be some more work to roll-back ad 
> file changes.
> 
>>>> Also, I don't see any changes related to stack management. So, I
>>>> assume it continues to be managed in slots. Any problems there? As I
>>>> understand, wide SVE registers are caller-save, so there may be many
>>>> spills of huge vectors around a call. (Probably, not possible with C2
>>>> auto-vectorizer as it is now, but Vector API will expose it.)
>>>>
>>>
>>> Yes, the stack is still managed in slots, but it will be allocated with
>>> real vector register length instead of 'virtual' slots for VecA. See the
>>> usages of scalable_reg_slots(), e.g. in chaitin.cpp:1587. We have also
>>> applied the patch to vector api, and did find a lot of vector spills
>>> with expected correct results.
>>
>> I'm curious whether similar problems may arise for spills. Considering
>> wide vector registers are caller-saved, it's possible to have lots of
>> 256-byte values to end up on stack (especially, with Vector API). Any
>> concerns with that?
>>
> 
> No, we don't need to have such big (256-byte) slots for a smaller vector 
> register. The spill slots are the same size as of real vector length, 
> e.g. 48 bytes for 384-bit vector. Even for alignment, we currently 
> choose SlotsPerVecA (8 slots for 32 bytes, 256 bits) for alignment 
> (skipped slots can still be allocated to other args), which is still 
> smaller than AVX512 (64 bytes, 512 bits). We can tweak the patch to 
> choose other smaller value, if we think the alignment is too large. 
> (Yes, we should always try to avoid spills for wide vectors, especially 
> with Vector API, to avoid performance pitfalls.)

Thanks for the clarifications.

Any new problems/hitting some limitations envisioned when spilling large 
number of huge vectors (2048-bit) on stack?

Best regards,
Vladimir Ivanov

>>>> Have you noticed any performance problems? If that's the case, then
>>>> AVX512 support on x86 would benefit from similar optimization as well.
>>>>
>>>
>>> Do you mean register allocation performance problems? I did not notice
>>> that before. Do you have any suggestion on how to measure that?
>>
>> I'd try to run some applications/benchmarks with -XX:+CITime to get a
>> sense how much RA may be affected.
>>
> 
> Thanks! I will give a try.
> 
> [1] 
> https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/releases/tag/0.9__;!!GqivPVa7Brio!IwFEx-c_8JDZcWgXPLcWp2ypX3pr1-IWTBfC7O7PHo7_0skMWtQa4fyWpo-lVor0NFv4Ivo$ 
> 
> Thanks,
> Ningsheng
>