RFR: 8351627: C2 AArch64 ROR/ROL: assert((1 << ((T>>1)+3)) > shift) failed: Invalid Shift value [v2]

Jatin Bhateja jbhateja at openjdk.org
Tue Mar 25 07:25:10 UTC 2025


On Tue, 18 Mar 2025 03:51:55 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> The following assertion fails on AArch64:
>> 
>> 
>>   Internal Error (jdk/src/hotspot/cpu/aarch64/assembler_aarch64.hpp:2991), pid=3822987, tid=3823007
>>   assert((1 << ((T>>1)+3)) > shift) failed: Invalid Shift value
>> 
>> 
>> with a simple Vector API case:
>> 
>> public static IntVector test() {
>>     IntVector iv = IntVector.zero(IntVector.SPECIES_128);
>>     return iv.lanewise(VectorOperators.ROR, iv);
>> }
>> 
>> 
>> On AArch64, vector `ROR/ROL` (rotate right/left) operations are implemented with a combination of shifts. Please see the pattern for `ROR`:
>> 
>> 
>>   lsr dst1, src, cnt            // unsigned right shift
>>   lsl dst2, src, bitSize - cnt  // left shift
>>   orr dst, dst1, dst2            // logical or
>> 
>> where `bitSize` is the element type width (e.g. `32` for `int`). In above case, `cnt` is a zero constant, resulting in a left shift of 32 (`bitSize - 0`), which exceeds the instruction's valid shift count range and triggers the assertion. To fix this, we need to mask the shift count to ensure it stays within valid range when calculating shift counts for rotate operations: `shiftCnt = shiftCnt & (bitSize - 1)`.
>> 
>> Note that the mask is only necessary for constant shift counts. This not only fixes the assertion failure, but also allows `ROR/ROL src, 0` to be optimized to `src` directly.
>> 
>> For vector variables as shift counts, the masking can be safely omitted because:
>> 1. Vector shift instructions that take a vector register as the shift count may not automatically apply modulo arithmetic based on register size. When the shift count is `32` for int type, the result may be either `zeros` or `src`. However, this doesn't affect correctness for rotate since the final result is combined with `src` using a logical `OR` operation.
>> 2. It saves a vector logical `AND` for masking, which is friendly to the performance.
>
> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update the test case

I don't have access to Grace :-) but I have verified your fix over Graviton3.

LGTM.

src/hotspot/share/opto/vectornode.cpp line 1687:

> 1685:     shiftRCnt = phase->transform(new AndINode(cnt, phase->intcon(shift_mask)));
> 1686:     shiftLCnt = phase->transform(new SubINode(phase->intcon(shift_mask + 1), shiftRCnt));
> 1687:     shiftLCnt = phase->transform(new AndINode(shiftLCnt, phase->intcon(shift_mask)));

FTR,  Java-side implementation ensures rounding off the shift count
https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L969

This is the edge case which does not impact x86 as instruction semantics are clear about setting the register to zero if [shift count is greater than respective lane count](https://www.felixcloutier.com/x86/psllw:pslld:psllq#:~:text=If%20the%20value%20specified%20by%20the%20count%20operand%20is%20greater%20than%2015%20(for%20words)%2C%2031%20(for%20doublewords)%2C%20or%2063%20(for%20a%20quadword)%2C%20then%20the%20destination%20operand%20is%20set%20to%20all%200s.%20Figure%204%2D17%20gives%20an%20example%20of%20shifting%20words%20in%20a%2064%2Dbit%20operand.) 

Thus right shifted vector by 0 shift count when ORed with all zero left-shited vector by 32 shift count still gives the correct result.

-------------

Marked as reviewed by jbhateja (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24051#pullrequestreview-2712495498
PR Review Comment: https://git.openjdk.org/jdk/pull/24051#discussion_r2011382909


More information about the hotspot-compiler-dev mailing list