RFR: 8303612: runtime/StackGuardPages/TestStackGuardPagesNative.java fails with exit code 139
mazhen
duke at openjdk.org
Mon Aug 4 18:47:44 UTC 2025
This pull request addresses an issue in `runtime/StackGuardPages/TestStackGuardPagesNative` where the native test component (`exeinvoke.c`) exhibited platform-dependent behavior and did not fully align with the intended test objectives for verifying stack guard page removal on thread detachment.
**Summary of the Problem:**
The `test_native_overflow_initial` scenario within `TestStackGuardPagesNative` showed inconsistent results:
* On certain Linux distributions (e.g., CentOS 7), the test would hang and eventually time out during its second phase of stack allocation.
* On other distributions (e.g., Ubuntu 24), the test would pass, but this pass was found to be coincidental, relying on an unintended `SEGV_MAPERR` to terminate a loop that should have had a defined exit condition.
The core issue was that the native code's second stack overflow attempt, designed to check for guard page removal, used an unbounded loop. Its termination (and thus the test's outcome) depended on platform-specific OS behavior regarding extensive stack allocation after guard pages are supposedly modified.
**Test Objective Analysis:**
The primary goal of `TestStackGuardPagesNative`, particularly for the initial thread (`test_native_overflow_initial`), is to:
1. **Verify Guard Page Presence:** Confirm that when a native thread is attached to the JVM, a deliberate stack overflow triggers a `SIGSEGV` with `si_code = SEGV_ACCERR`, indicating an active stack guard page.
2. **Verify Guard Page Removal/Modification:** After the thread detaches from the JVM via `DetachCurrentThread()`, confirm that the previously active stack guard page is no longer enforcing the same strict protection. This is ideally demonstrated by successfully allocating stack space up to the depth that previously caused the `SEGV_ACCERR`, **without encountering any signal**.
**How the Original Implementation Deviated from the Test Intent:**
The native `do_overflow` function, when invoked for the second phase (to check guard page removal), implemented an unconditional `for(;;)` loop.
* **Intended Logic vs. Actual Behavior:** The test intended for this second phase to demonstrate that allocations up to the prior failure depth are now "clean" (no `SEGV_ACCERR`). However, the unbounded loop meant:
* On systems like CentOS 7, where deep stack allocation without an immediate `SEGV_MAPERR` was possible, this loop ran for an excessive duration, leading to a hang.
* On systems like Ubuntu 24, the loop *was* terminated by a `SEGV_MAPERR`. The existing result-checking logic in `run_native_overflow` incorrectly treated this `SEGV_MAPERR` as a "pass" condition for guard page removal, which is misleading. The true indication of successful guard page removal is the *absence* of any signal during the controlled second allocation phase.
This reliance on incidental, platform-dependent signal behavior to terminate the loop (and the misinterpretation of `SEGV_MAPERR` as a success) meant the test was not robustly verifying its core objective.
**Proposed Solution:**
This PR modifies the `do_overflow` function in `exeinvoke.c` to accurately reflect the test intent for both phases of stack overflow checking, ensuring deterministic behavior:
The `do_overflow` function is updated to use a single `while` loop whose condition correctly handles both the initial overflow check and the subsequent verification of guard page removal:
void do_overflow(){
volatile int *p = NULL;
// The loop condition is true if:
// 1. It's the first run (_kp_rec_count == 0), causing the loop to run until SIGSEGV.
// 2. It's the second run (_kp_rec_count > 0) AND the current allocation depth (_rec_count)
// is less than the depth that previously caused SIGSEGV (_kp_rec_count).
while (_kp_rec_count == 0 || _rec_count < _kp_rec_count) {
_rec_count++;
p = (int*)alloca(128);
_peek_value = p[0]; // Ensure memory is touched to trigger guard page if active
}
// - On the first run, this point is never reached due to longjmp from the signal handler.
// - On the second run, if no SIGSEGV occurs, the loop completes when _rec_count == _kp_rec_count.
}
**Impact and Clarity of Test Outcome with This Change:**
* **Prevents Hangs:** The primary impact of this change is the elimination of the potential for an infinite (or excessively long) loop during the second phase of `do_overflow`. The loop now has a defined boundary based on `_kp_rec_count`. This directly resolves the hang observed on platforms like CentOS 7.
* **Deterministic Execution of `do_overflow`:** The `do_overflow` function will now consistently:
* In the first run: Execute until a `SIGSEGV` occurs and `longjmp` is called.
* In the second run: Execute allocations up to `_kp_rec_count` times. If no `SIGSEGV` occurs, it will complete this bounded loop and return normally. If a `SIGSEGV` *does* occur within these `_kp_rec_count` allocations, `longjmp` will be called.
This refinement ensures `TestStackGuardPagesNative` more accurately and reliably verifies the stack guard page lifecycle for native threads attached to the JVM.
-------------
Commit messages:
- update copyright year in StackGuardPages test
- 8303612: runtime/StackGuardPages/TestStackGuardPagesNative.java fails with exit code 139
Changes: https://git.openjdk.org/jdk/pull/25689/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25689&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8303612
Stats: 8 lines in 2 files changed: 0 ins; 3 del; 5 mod
Patch: https://git.openjdk.org/jdk/pull/25689.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/25689/head:pull/25689
PR: https://git.openjdk.org/jdk/pull/25689
More information about the hotspot-runtime-dev
mailing list