RFR: 8338197: [ubsan] ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int' [v5]

Tue Sep 30 16:35:22 UTC 2025

On Wed, 17 Sep 2025 18:32:10 GMT, Boris Ulasevich <bulasevich at openjdk.org> wrote:

>> This reworks the recent update https://github.com/openjdk/jdk/pull/24696 to fix a UBSan issue on aarch64. The problem now reproduces on x86_64 as well, which suggests the previous update was not optimal.
>> 
>> The issue reproduces with a HeapByteBufferTest jtreg test on a UBSan-enabled build. Actually the trigger is `XX:+OptoScheduling` option used by test (by default OptoScheduling is disabled on most x86 CPUs). With the option enabled, the failure can be reproduced with a simple `java -version` run.
>> 
>> This fix is in ADLC-generated code. For simplicity, the examples below show the generated fragments.
>> 
>> The problems is that shift count `n` may be too large here:
>> 
>> class Pipeline_Use_Cycle_Mask {
>> protected:
>>   uint _mask;
>>   ..
>>   Pipeline_Use_Cycle_Mask& operator<<=(int n) {
>>     _mask <<= n;
>>     return *this;
>>   }
>> };
>> 
>> The recent change attempted to cap the shift amount at one call site:
>> 
>> class Pipeline_Use_Element {
>> protected:
>>   ..
>>   // Mask of specific used cycles
>>   Pipeline_Use_Cycle_Mask _mask;
>>   ..
>>   void step(uint cycles) {
>>     _used = 0;
>>     uint max_shift = 8 * sizeof(_mask) - 1;
>>     _mask <<= (cycles < max_shift) ? cycles : max_shift;
>>   }
>> }
>> 
>> However, there is another site where `Pipeline_Use_Cycle_Mask::operator<<=` can be called with a too-large shift count:
>> 
>> // The following two routines assume that the root Pipeline_Use entity
>> // consists of exactly 1 element for each functional unit
>> // start is relative to the current cycle; used for latency-based info
>> uint Pipeline_Use::full_latency(uint delay, const Pipeline_Use &pred) const {
>>   for (uint i = 0; i < pred._count; i++) {
>>     const Pipeline_Use_Element *predUse = pred.element(i);
>>     if (predUse->_multiple) {
>>       uint min_delay = 7;
>>       // Multiple possible functional units, choose first unused one
>>       for (uint j = predUse->_lb; j <= predUse->_ub; j++) {
>>         const Pipeline_Use_Element *currUse = element(j);
>>         uint curr_delay = delay;
>>         if (predUse->_used & currUse->_used) {
>>           Pipeline_Use_Cycle_Mask x = predUse->_mask;
>>           Pipeline_Use_Cycle_Mask y = currUse->_mask;
>> 
>>           for ( y <<= curr_delay; x.overlaps(y); curr_delay++ )
>>             y <<= 1;
>>         }
>>         if (min_delay > curr_delay)
>>           min_delay = curr_delay;
>>       }
>>       if (delay < min_delay)
>>       delay = min_delay;
>>     }
>>     else {
>>       for (uint j = predUse->_lb; j <= pre...
>
> Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
> 
>  - reduce fixed_latency(100) to fixed_latency(30) for calls/traps on ARM, PPC, RISC-V, X86
>  - use uint32_t for _mask
>  - remove redundant code
>  - 8338197: ubsan: ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int'

Thanks for the detailed feedback, Andrew.

We can not tolerate unpredictable C++ behavior. Independently of pipeline model accuracy or rule weighting, relying on implementation-defined/undefined behavior (shifts beyond the bit-width) is not acceptable. UBSan flags it for a reason, and the fix here is to make the shift provably safe at the point where it happens, so all call sites are covered- no surprises, no sanitizer noise, no dependence on compiler whims.

What this PR actually does is apply the cap `_mask <<= (cycles < max_shift) ? cycles : max_shift` - this capping already exists today; the change just moves it to the correct place (inside the operator) so it’s always applied. In effect, it merely formalizes what we intend the hardware/scheduler to see: a saturated shift rather than an out-of-range one. This should be a No Functional Change by design for code generation and strictly a safety/robustness improvement.

I agree that the broader question - how "latency" values are used both as rule weights and as scheduler costs - deserves its own discussion. On modern out-of-order AArch64 cores, global scheduling hardly brings any extra value, we could constrain/disable it without regressions. Actually, I tried disabling the global OptoScheduling option on AARCH, but found at least one benchmark that showed a regression.

About changing fixed_latency(100->30) - I acknowledge this is debatable. My expectation is that staying within the mask’s representable range does not change codegen, but if this part is contentious I’m fine to drop it from this PR and keep only the safe-shift fix. Alternately, I can validate the impact by mechanically diffing large disassembly sets, and I expect to find a zero differences.

Latencies do affect selection - and that’s OK. Yes, I’ve observed selection changes when adjusting latencies too. However, I believe the AD rules/models are sane. If they are not, we will file an issue and correct it within a separate RFE, rather than keep undefined behavior around to preserve the status quo.

Proposal for this PR:
- land the shift-capping at the source (inside the operator) as a strict safety fix.
- optionally omit the 100->30 tweak.

Let’s keep this PR as a minimal sanitizer fix and move the model/weighting discussion to a follow-up RFE. I’m open to doing some research in this area, though I’m not entirely sure what the best starting point would be. That said, if you feel strongly against this change, I don’t mind dropping it and leaving the rest as is.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26890#issuecomment-3352975096