RFR: 8338197: [ubsan] ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int' [v3]

Tue Sep 30 08:48:43 UTC 2025

On Mon, 29 Sep 2025 12:56:03 GMT, Boris Ulasevich <bulasevich at openjdk.org> wrote:

> That’s right. I tried setting more precise latencies for aarch.ad instructions per the architecture guide to see if we’d see a speedup. I did observe instruction reordering as a result of the change, but on a modern out-of-order CPU I saw no measurable performance change.

The risk is not to do with re-ordering of instructions. Latencies can be used to select alternative instruction+operand reduction paths i.e. to translate a compiler node or node subgraph into alternative machine nodes. So, the important risk is one of *semantic* differences not performance differences.

I am not claiming that changing the latencies will lead to alternative reductions that change behaviour. However, I am also not sure it won't. Latencies have definitely been used to enforce rule ordering in aarch64 (I have done so) and in x86 (that's where I learned to do it). One example from aarch64 is rules that select volatile and non-volatile load or store instruction encodings. So, I don't like the idea of capping latencies to 30 without investigating the effect on instruction+operand selection. To be safe, that would really need a thorough review of the instruction+operand rules for all arches, including how they might operate in combination. Testing may indicate no problem but the validity of that result depends on whether the tests cover all potential cases.

> @adinn, do you think I can push this change? Is this worth deeper investigation, now or later?

I'd be wary of assuming it is ok as explained above. I think we should simply ignore this warning for now and leave the code as is until we know what we want to do instead.

We could decouple the instruction selection issue from scheduling by cloning the latency field into an artificial, unconstrained 'weight' field that is used to influence rule matching. That would allow us to adjust the latencies independently and range check them against the number of pipeline elements in the model for sanity. If we did that we could then independently consider whether these 'scheduler latencies' are actually buying us anything.

However, I don't think we should go down that path without looking further into how effective the pipeline model is. Looking at how x86 and aarch64 are modelled I think the answer to that is 'not very'. It may be more so for other arches but I'm sceptical.

Specifically, regarding the effect of latencies > 30 on instruction scheduling, I don't expect capping them will make any difference to scheduling. On the other hand, the fact that we have latencies over 30 is not really causing any real problem (apart from this 'error' being flagged). The pipeline model uses the latency to index a bit mask whose bits represent pipeline stages, looking for overlaps with other instructions in the same bundle i.e. if ins1 uses stage 1 then stage 2 then stage 4 over 3 cycles and ins2 uses stage 1 and then stage 4 over 2 cycles then scheduling (ins1;ins2) one cycle apart will clash on stage 4 whereas scheduling (ins2;ins1) will not clash. What this implies is that valid bits in any pipeline mask are gated by the number of pipeline elements and I don't believe any arch has a total pipeline element count >= 30. That means for large latencies, whether the value is 30 or 100, it will always constitute a miss. The upshot is that the scheduler will not find an
 y reason (beyond input dependencies) to order instructions with large latencies before or after any other instructions. That's probably going to be irrelevant on most modern architectures since the CPU knows how to reorder instructions better than our pipeline model.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26890#issuecomment-3350739975