[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Fri Aug 21 07:56:19 UTC 2020

Hi Vladimir,

Thanks a lot for looking at this!

On 8/20/20 8:29 PM, Vladimir Ivanov wrote:
> Hi Ningsheng,
> 
>> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-July/039289.html 
> 
> 
> Impressive work, Ningsheng!
> 
>> http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt
> 
> "Since the bottom 128 bits are shared with the NEON, we extend current
> register mask definition of V0-V31 registers. Currently, c2 uses one bit
> mask for a 32-bit register slot, so to define at most 2048 bits we will
> need to add 64 slots in AD file. That's a really large number, and will
> also break current regmask assumption."
> 
> Can you, please, elaborate on the last point? What RegMask assumptions 
> are broken for 2048-bit vectors? I'm looking at [1] and try to 
> understand the motivation for the changes in shared code.

Current regmask is handled by an array of ints, so an element of regmask 
array can handle at most 32*32=1024 bits. Some regmask handling 
functions, e.g. clear_to_sets() for alignment, need to be re-examined 
for the support of 2048 bits. And we may even want to support non 
power-of-two physical reg sizes, that could be a lot more work.

> 
> Compared to x86 w/ AVX512, architectural state for vector registers is 
> 4x larger in the worst case (ignoring predicate registers for now). Here 
> are the relevant constants on x86:
> 
> gensrc/adfiles/adGlobals_x86.hpp:
> 
> // the number of reserved registers + machine registers.
> #define REG_COUNT    545
> ...
> // Size of register-mask in ints
> #define RM_SIZE 22
> 
> My estimate is that for AArch64 with SVE support the constants will be:
> 
>    REG_COUNT < 2500
>    RM_SIZE < 100
> 
> which don't look too bad.
> 

Right, but given that most real hardware implementations will be no 
larger than 512 bits, I think. Having a large bitmask array, with most 
bits useless, will be less efficient for regmask computation.

> Also, I don't see any changes related to stack management. So, I assume 
> it continues to be managed in slots. Any problems there? As I 
> understand, wide SVE registers are caller-save, so there may be many 
> spills of huge vectors around a call. (Probably, not possible with C2 
> auto-vectorizer as it is now, but Vector API will expose it.)
> 

Yes, the stack is still managed in slots, but it will be allocated with 
real vector register length instead of 'virtual' slots for VecA. See the 
usages of scalable_reg_slots(), e.g. in chaitin.cpp:1587. We have also 
applied the patch to vector api, and did find a lot of vector spills 
with expected correct results.

> Have you noticed any performance problems? If that's the case, then 
> AVX512 support on x86 would benefit from similar optimization as well.
> 

Do you mean register allocation performance problems? I did not notice 
that before. Do you have any suggestion on how to measure that?

> FTR there was a similar exercise [2] on x86 to abstract away exact sizes 
> of vector registers, but it didn't have to worry about RA since all the 
> operands were already available. Also, vectors of all different sizes 
> may be used. So, it makes it hard to compare.
> 

I've also noticed that. That's an excellent work indeed. It could save a 
lot of backend match rules for different vector register sizes, which 
was one of the concerns when we started to work on SVE RA, if we defined 
all regmasks for different SVE vector register sizes. And yes, our 
current approach will also solve that problem. :-)

> Best regards,
> Vladimir Ivanov
> 
> [1] http://cr.openjdk.java.net/~njian/8231441/webrev.03-ra/
> 
> [2] https://bugs.openjdk.java.net/browse/JDK-8230015
> 

Thanks,
Ningsheng