RFR: 8303612: runtime/StackGuardPages/TestStackGuardPagesNative.java fails with exit code 139

Fri Sep 5 10:07:13 UTC 2025

On Mon, 1 Sep 2025 04:32:05 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> This pull request addresses an issue in `runtime/StackGuardPages/TestStackGuardPagesNative` where the native test component (`exeinvoke.c`) exhibited platform-dependent behavior and did not fully align with the intended test objectives for verifying stack guard page removal on thread detachment.
>> 
>> **Summary of the Problem:**
>> 
>> The `test_native_overflow_initial` scenario within `TestStackGuardPagesNative` showed inconsistent results:
>> *   On certain Linux distributions (e.g., CentOS 7), the test would hang and eventually time out during its second phase of stack allocation.
>> *   On other distributions (e.g., Ubuntu 24), the test would pass, but this pass was found to be coincidental, relying on an unintended `SEGV_MAPERR` to terminate a loop that should have had a defined exit condition.
>> 
>> The core issue was that the native code's second stack overflow attempt, designed to check for guard page removal, used an unbounded loop. Its termination (and thus the test's outcome) depended on platform-specific OS behavior regarding extensive stack allocation after guard pages are supposedly modified.
>> 
>> **Test Objective Analysis:**
>> 
>> The primary goal of `TestStackGuardPagesNative`, particularly for the initial thread (`test_native_overflow_initial`), is to:
>> 1.  **Verify Guard Page Presence:** Confirm that when a native thread is attached to the JVM, a deliberate stack overflow triggers a `SIGSEGV` with `si_code = SEGV_ACCERR`, indicating an active stack guard page.
>> 2.  **Verify Guard Page Removal/Modification:** After the thread detaches from the JVM via `DetachCurrentThread()`, confirm that the previously active stack guard page is no longer enforcing the same strict protection. This is ideally demonstrated by successfully allocating stack space up to the depth that previously caused the `SEGV_ACCERR`, **without encountering any signal**.
>> 
>> **How the Original Implementation Deviated from the Test Intent:**
>> 
>> The native `do_overflow` function, when invoked for the second phase (to check guard page removal), implemented an unconditional `for(;;)` loop.
>> *   **Intended Logic vs. Actual Behavior:** The test intended for this second phase to demonstrate that allocations up to the prior failure depth are now "clean" (no `SEGV_ACCERR`). However, the unbounded loop meant:
>>     *   On systems like CentOS 7, where deep stack allocation without an immediate `SEGV_MAPERR` was possible, this loop ran for an excessive duration, leading to a hang.
>>  ...
>
> I would suggest filing a new bug for the CentOS hang and create a new PR of the change here, with that new bug ID.

Hi @dholmes-ora and @sendaoYan,

Thank you so much, @sendaoYan, for creating the JBS issue for me! I really appreciate your help in moving this forward.

As suggested by @dholmes-ora, I am now closing this PR because it was linked to the incorrect JBS issue.

I have followed the advice and taken the following steps:
1.  A new JBS issue has been created to correctly track the hang on CentOS: **JDK-8366787**.
2.  I have created a new Pull Request containing the same fix, now linked to this new, correct issue.

The new PR is available here: [#27114](https://github.com/openjdk/jdk/pull/27114)

Thank you again for all your support in getting this on the right track. All future discussion should please continue on the new PR.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25689#issuecomment-3257797404