[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Fri Aug 21 22:34:40 UTC 2020

Thanks for clarifications, Ningsheng.

Let me share my thoughts on the topic and I'll start with summarizing 
the experience of migrating x86 code to generic vectors.

JVM has quite a bit of special logic to support vectors. It hasn't 
exhausted the complexity budget yet, but it's quite close to the limit 
(as you probably noticed). While extending x86 backend to support Vector 
API, we pushed it over the limit and had to address some of the issues.

The ultimate goal was to move to vectors which represent full-width 
hardware registers. After we were convinced that it will work well in AD 
files, we encountered some inefficiencies with vector spills: depending 
on actual hardware, smaller (than available) vectors may be used (e.g., 
integer computations on AVX-capable CPU). So, we stopped half-way and 
left post-matching part intact: depending on actual vector value width, 
appropriate operand (vecX/vecY/vecZ + legacy variants) is chosen.

(I believe you may be in a similar situation on AArch64 with NEON vs SVE 
where both 128-bit and wide SVE vectors may be used at runtime.)

Now back to the patch.

What I see in the patch is that you try to attack the problem from the 
opposite side: you introduce new concept of a size-agnostic vector 
register on RA side and then directly use it during matching: vecA is 
used in aarch64_sve.ad and aarch64.ad relies on vecD/vecX.

Unfortunately, it extends the implementation in orthogonal direction 
which looks too aarch64-specific to benefit other architectures and x86 
particular. I believe there's an alternative approach which can benefit 
both aarch64 and x86, but it requires more experimentation.

If I were to start from scratch, I would choose between 3 options:

   #1: reuse existing VecX/VecY/VecZ ideal registers and limit supported 
vector sizes to 128-/256-/512-bit values.

   #2: lift limitation on max size (to 1024/2048 bits), but ignore 
non-power-of-2 sizes;

   #3: introduce support for full range of vector register sizes 
(128-/.../2048-bit with 128-bit step);

I see 2 (mostly unrelated) limitations: maximum vector size and 
non-power-of-2 sizes.

My understanding is that you don't try to accurately represent SVE for 
now, but lay some foundations for future work: you give up on 
non-power-of-2 sized vectors, but still enable support for arbitrarily 
sized vectors (addressing both limitations on maximum size and size 
granularity) in RA (and it affects only spills). So, it is somewhere 
between #2 and #3.

The ultimate goal is definitely #3, but how much more work will be 
required to teach the JVM about non-power-of-2 vectors? As I see in the 
patch, you don't have auto-vectorizer support yet, but Vector API will 
provide access to whatever size hardware exposes. What do you expect on 
hardware front in the near/mid-term future? Anything supporting vectors 
larger than 512-bit? What about 384-bit vectors?

I don't have a good understanding where SVE/SVE2-capable hardware is 
moving and would benefit a lot from your insights about what to expect.

If 256-/512-bit vectors end up as the only option, then #1 should fit 
them well.

For larger vectors #2 (or a mix of #1 and #2) may be a good fit. My 
understanding that existing RA machinery should support 1024-bit vectors 
well. So, unless 2048-bit vectors are needed, we could live with the 
framework we have right now.

If hardware has non-power-of-2 vectors, but JVM doesn't support them, 
then JVM can work with just power-of-2 portion of them (384-bit => 256-bit).

Giving up on #3 for now and starting with less ambitious goals (#1 or 
#2) would reduce pressure on RA and give more time for additional 
experiments to come with a better and more universal 
support/representation of generic/size-agnostic vectors. And, in a 
longer term, help reducing complexity and technical debt in the area.

Some more comments follow inline.

>> Compared to x86 w/ AVX512, architectural state for vector registers is 
>> 4x larger in the worst case (ignoring predicate registers for now). 
>> Here are the relevant constants on x86:
>>
>> gensrc/adfiles/adGlobals_x86.hpp:
>>
>> // the number of reserved registers + machine registers.
>> #define REG_COUNT    545
>> ...
>> // Size of register-mask in ints
>> #define RM_SIZE 22
>>
>> My estimate is that for AArch64 with SVE support the constants will be:
>>
>>    REG_COUNT < 2500
>>    RM_SIZE < 100
>>
>> which don't look too bad.
>>
> 
> Right, but given that most real hardware implementations will be no 
> larger than 512 bits, I think. Having a large bitmask array, with most 
> bits useless, will be less efficient for regmask computation.

Does it make sense to limit the maximum supported size to 512-bit then 
(at least, initially)? In that case, the overhead won't be worse it is 
on x86 now.

>> Also, I don't see any changes related to stack management. So, I 
>> assume it continues to be managed in slots. Any problems there? As I 
>> understand, wide SVE registers are caller-save, so there may be many 
>> spills of huge vectors around a call. (Probably, not possible with C2 
>> auto-vectorizer as it is now, but Vector API will expose it.)
>>
> 
> Yes, the stack is still managed in slots, but it will be allocated with 
> real vector register length instead of 'virtual' slots for VecA. See the 
> usages of scalable_reg_slots(), e.g. in chaitin.cpp:1587. We have also 
> applied the patch to vector api, and did find a lot of vector spills 
> with expected correct results.

I'm curious whether similar problems may arise for spills. Considering 
wide vector registers are caller-saved, it's possible to have lots of 
256-byte values to end up on stack (especially, with Vector API). Any 
concerns with that?

>> Have you noticed any performance problems? If that's the case, then 
>> AVX512 support on x86 would benefit from similar optimization as well.
>>
> 
> Do you mean register allocation performance problems? I did not notice 
> that before. Do you have any suggestion on how to measure that?

I'd try to run some applications/benchmarks with -XX:+CITime to get a 
sense how much RA may be affected.

Best regards,
Vladimir Ivanov