RFR: 8370473: C2: Better Aligment of Vector Spill Slots [v4]
Richard Reingruber
rrich at openjdk.org
Fri Nov 21 11:03:58 UTC 2025
On Thu, 20 Nov 2025 10:21:34 GMT, Richard Reingruber <rrich at openjdk.org> wrote:
>> With this change c2 will allocate spill slots for vectors with sp offsets aligned to the size of the vectors. Maximum alignment is StackAlignmentInBytes.
>>
>> It also updates comments that have never been changed to describe how register allocation works for sizes larger than 64 bit.
>>
>> The change helps to produce better spill code on AARCH64 and PPC64 where an additional add instruction is emitted if the offset of a vector un-/spill is not aligned.
>>
>> The change is rather a cleanup than an optimization. In most cases the sp offsets will already be properly aligned.
>> Only with incoming stack arguments unaligned offsets can be generated. But also then alignment padding is only added if vector registers larger than 64 bit are used.
>>
>> So the costs are effectively zero. Especially because extra padding won't enlarge the frame since only virtual registers are allocated which are mapped to the caller frame (see `pad0` in the [diagram](https://github.com/openjdk/jdk/blob/92e380c59c2498b1bc94e26658b07b383deae59a/src/hotspot/cpu/aarch64/aarch64.ad#L3829))
>>
>> There's a risk though that with the extra virtual registers allocated for `pad0` the limit of registers a `RegMask` can represent is reached (occurs with excessive spilling). If this happens the compilation would fail. It could be retried with smaller alignment for vector spilling though. I havn't implemented it as I thought the risk is negligible.
>>
>> Note that the sp offset of the accesses should be aligned rather than the effective address. So it could even be argued that the maximum alignment could be higher than StackAlignmentInBytes.
>>
>> ##### Testing with fastdebug builds on AARCH64 and PPC64:
>>
>> hotspot_vector_1
>> hotspot_vector_2
>> jdk_vector
>> jdk_vector_sanity
>>
>> ##### The change passed our CI testing:
>> Tier 1-4 of hotspot and jdk. All of langtools and jaxp. Renaissance Suite and SAP specific tests.
>> Testing was done on the main platforms and also on Linux/PPC64le and AIX.
>>
>> C2 compilation of `jdk.internal.vm.vector.VectorSupport::rearrangeOp` has unaligned spill offsets. It is covered by the following tests:
>>
>> compiler/vectorapi/VectorRearrangeTest.java
>> jdk/incubator/vector/Byte128VectorLoadStoreTests.java
>> jdk/incubator/vector/Double256VectorLoadStoreTests.java
>> jdk/incubator/vector/Float128VectorTests.java
>> jdk/incubator/vector/Long256VectorLoadStoreTests.java
>> jdk/incubator/vector/Short128VectorLoadStoreTests.java
>> jdk/incubator/vector/Vector64ConversionTests.java
>
> Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision:
>
> - Merge branch 'master'
> - Exclude IR check on riscv with rvv
> - Enhance comment
> - Fix OptoAssembly for Power 8
> - PPC: OptoAssembly for vector spilling
> - Assert aligned sp offsets in vector spilling
> - Delete TMP and !UseNewCode
> - Align Matcher::_new_SP for better vector spilling
> - TMP: trace unaligned vector spilling
> - Add test
I'd like to give a little example that's supposed to show that this pr will help reduce frame size rather then increase it.
Example:
- VectorX v1, v2 are spilled
- register sets are aligned to the set size, here SlotsPerVecX = 4
- simplification: no out args, out preserve
Baseline: _new_SP always aligned to SlotsPerLong = 2
Slots
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 ... 99
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
| | | | | | |un-| | |
| | | | | | |usd| v1 | v2 | locks
| | | | | | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
^
|
_new_SP _old_SP
|<--- own frame --->|
Spill area: slots 6 - 15 = 10 slots
Frame size (_new_SP to _old_SP ([see diagram](https://github.com/openjdk/jdk/blob/92e380c59c2498b1bc94e26658b07b383deae59a/src/hotspot/cpu/aarch64/aarch64.ad#L3837-L3856))) = 100 - 6 = 94 slots
Slots 6 and 7 are unused because v1 and v2 are aligned to their size. They are part of the frame.
Pr: _new_SP aligned SlotsPerVecX = 4 because there are spills of that size
Slots
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 ... 99
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
| | | | | | |un-| | |
| | | | | | |usd| v1 | v2 | locks
| | | | | | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
^
|
_new_SP _old_SP
|<--- own frame --->|
Spill area: slots 8 - 15 = 8 slots
Frame size (_new_SP to _old_SP ([see diagram](https://github.com/openjdk/jdk/blob/92e380c59c2498b1bc94e26658b07b383deae59a/src/hotspot/cpu/aarch64/aarch64.ad#L3837-L3856))) = 100 - 8 = 92 slots
Slots 6 and 7 are still only used for alignment but they are not part of the frame.
The resulting frame size is smaller with this pr.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/27969#issuecomment-3562513261
More information about the hotspot-compiler-dev
mailing list