RFR: 8347406: [REDO] C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash)

Thu Feb 20 21:38:54 UTC 2025

On Fri, 14 Feb 2025 11:04:20 GMT, Damon Fenacci <dfenacci at openjdk.org> wrote:

> # Issue
> The test `src/hotspot/share/opto/c2compiler.cpp` fails intermittently due to a crash that happens when trying to allocate code cache space for C1 and C2 in `RuntimeStub::new_runtime_stub` and `SingletonBlob::operator new`.
> 
> # Causes
> There are a few call paths during the initialization of C1 and C2 that can lead to the code cache allocations in `RuntimeStub::new_runtime_stub` (through `RuntimeStub::operator new`) and `SingletonBlob::operator new` triggering a fatal error if there is no more space. The paths in question are:
> 1. `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`
> 1. `C2Compiler::initialize` -> `C2Compiler::init_c2_runtime` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`
> 1. `C2Compiler::initialize` -> `C2Compiler::init_c2_runtime` -> `OptoRuntime::generate` -> `OptoRuntime::generate_uncommon_trap_blob` -> `UncommonTrapBlob::create` -> `new UncommonTrapBlob`
> 1. `C2Compiler::initialize` -> `C2Compiler::init_c2_runtime` -> `OptoRuntime::generate` -> `OptoRuntime::generate_exception_blob` -> `ExceptionBlob::create` -> `new ExceptionBlob`
> 
> # Solution
> Instead of fatally crashing the we can use the `alloc_fail_is_fatal` flag of `RuntimeStub::new_runtime_stub` to avoid crashing in cases 1 and 2 and add a similar flag to `SingletonBlob::operator new` for cases 3 and 4. In the latter case we need to adjust all calls accordingly.
> 
> Note: In [JDK-8326615](https://bugs.openjdk.org/browse/JDK-8326615) it was argued that increasing the minimum code cache size would solve the issue but that wasn't entirely accurate: doing so possibly decreases the chances of a failed allocation in these 4 places but doesn't totally avoid it.
> 
> # Testing
> The original failing regression test in `test/hotspot/jtreg/compiler/startup/StartupOutput.java` has been modified to run multiple times with randomized values (within the original failing range) to increase the chances of hitting the fatal assertion.
> 
> Tests: Tier 1-4 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode)

src/hotspot/share/gc/shenandoah/c1/shenandoahBarrierSetC1.cpp line 306:

> 304:                                 _load_reference_barrier_phantom_rt_code_blob != nullptr;
> 305:   }
> 306:   return _pre_barrier_c1_runtime_code_blob != nullptr && reference_barrier_success;

Wouldn't it be better to return false immediately after each failure, rather than continuing?

src/hotspot/share/gc/z/c1/zBarrierSetC1.cpp line 543:

> 541:   _store_barrier_on_oop_field_without_healing =
> 542:     generate_c1_store_runtime_stub(blob, false /* self_healing */, "store_barrier_on_oop_field_without_healing");
> 543:   return _load_barrier_on_oop_field_preloaded_runtime_stub != nullptr &&

Again, why not return false immediately on first failure?

src/hotspot/share/opto/output.cpp line 3487:

> 3485:         C->record_failure("CodeCache is full");
> 3486:       } else {
> 3487:         C->set_stub_entry_point(rs->entry_point());

Is the deleted rs->is_runtime_stub() assert still useful here?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23630#discussion_r1964370135
PR Review Comment: https://git.openjdk.org/jdk/pull/23630#discussion_r1964371192
PR Review Comment: https://git.openjdk.org/jdk/pull/23630#discussion_r1964372958