RFR: 8303612: runtime/StackGuardPages/TestStackGuardPagesNative.java fails with exit code 139
mazhen
duke at openjdk.org
Tue Aug 26 11:04:35 UTC 2025
On Tue, 5 Aug 2025 02:28:41 GMT, David Holmes <dholmes at openjdk.org> wrote:
>> Hi @jdksjolen ,
>>
>> Following up on this. As per the bot's message two weeks ago, I've been waiting for my OCA to be processed, but it seems to be stuck.
>> I understand that PRs are not reviewed until the OCA is cleared, but since the suggested two-week waiting period has passed, I was hoping someone could help to check or escalate the status of my OCA application internally.
>>
>> My Oracle Account Email: mz1999 at gmail.com
>>
>> Any help would be greatly appreciated. Thank you!
>
> @mz1999 the forever-loop was added as part of the "hardening" changes by JDK-8295344, so it seems you are saying that introduced a new bug? How does your proposed change compare to the code that we had prior to the "hardening"? I just want to try and see what the original problem with this code was. Thanks
Hi @dholmes-ora and @jdksjolen ,
Apologies for the delayed response.
Thank you for the comments. I wasn't aware of the historical context behind this test and the changes from JDK-8295344. My investigation started simply because I observed that the test behaved differently on two platforms. To help clarify what I'm seeing, let me walk you through my testing process with the original, unmodified `exeinvoke.c` code.
I reverted my local changes and ran the test on both Ubuntu 24 and CentOS 7.
### Test Environments
| Component | Ubuntu 24.04 LTS | CentOS 7.9.2009 |
| -------------- | ----------------------------------------- | ------------------------------------ |
| **OS Release** | Ubuntu 24.04.3 LTS | CentOS Linux release 7.9.2009 (Core) |
| **Kernel** | 6.8.0-78-generic | 3.10.0-1160.el7.x86_64 |
| **glibc** | ldd (Ubuntu GLIBC 2.39-0ubuntu8.5) 2.39 | ldd (GNU libc) 2.17 |
| **GCC** | gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 | gcc (GCC) 13.2.0 |
### Test on Ubuntu 24.04 LTS
I ran the test using the following command to retain all test artifacts:
make test TEST="test/hotspot/jtreg/runtime/StackGuardPages/TestStackGuardPagesNative.java" JTREG="RETAIN=all"
The test passed successfully. The `TestStackGuardPagesNative.jtr` file shows:
...
----------System.out:(8/626)----------
[2025-08-26T07:35:59.646208363Z] Gathering output for process 379416
[2025-08-26T07:35:59.656232016Z] Waiting for completion for process 379416
[2025-08-26T07:35:59.656384019Z] Waiting for completion finished for process 379416
Output and diagnostic info for process 379416 was saved into 'pid-379416-output.log'
[2025-08-26T07:35:59.658186828Z] Gathering output for process 379439
[2025-08-26T07:35:59.658413390Z] Waiting for completion for process 379439
[2025-08-26T07:35:59.677953883Z] Waiting for completion finished for process 379439
Output and diagnostic info for process 379439 was saved into 'pid-379439-output.log'
----------System.err:(3/35)----------
JavaTest Message: Test complete.
result: Passed. Execution successful
test result: Passed. Execution successful
To understand *how* it passed, I examined the native process log for the `test_native_overflow_initial` scenario (`pid-379439-output.log`):
--- ProcessLog ---
cmd: /home/mazhen/works/jdk/build/linux-x86_64-server-release/images/test/hotspot/jtreg/native/invoke test_native_overflow_initial
exitvalue: 0
stderr:
stdout: Test started with pid: 379439
Java thread is alive.
Testing NATIVE_OVERFLOW
Testing stack guard page behaviour for initial thread
run_native_overflow 379439
Got SIGSEGV(2) at address: 0x7fff45e5fff0
Test PASSED. Got access violation accessing guard page at 7104
Got SIGSEGV(1) at address: 0x7fff4575bf80
Test PASSED. No stack guard page is present. SIGSEGV(1) at 58191
The log shows that the second `do_overflow` call for the initial thread was terminated by a `SIGSEGV(1)` (`SEGV_MAPERR`). This observation prompted me to think about what the test's true intent might be.
Of course, this is just my interpretation, and I may be mistaken. From my reading, the test is designed as a two-phase process:
1. **Probe:** The first `do_overflow` call finds the exact allocation depth (`_rec_count`) that triggers the `SEGV_ACCERR` from the guard page.
2. **Verify:** The second `do_overflow` call should then confirm that after detaching the thread, it's safe to allocate up to that same depth again, this time completing **without any signal**.
A true "pass" for this verification should therefore correspond to the `_last_si_code == -1` check in the code, which explicitly handles the "no signal" scenario.
} else if (_last_si_code == -1) {
printf("Test PASSED. No stack guard page is present. Maximum recursion level reached at %d\n", _rec_count);
}
However, the current test passes on Ubuntu not for this reason, but because the second overflow attempt is coincidentally terminated by a different signal, a `SEGV_MAPERR`. This shows that the test is not passing as intended.
### 2. Test on CentOS 7 (Hangs and Times Out)
Running the same test on CentOS 7 resulted in a timeout after 480 seconds. The `TestStackGuardPagesNative.jtr` file shows:
...
----------System.out:(6/436)----------
[2025-08-26T09:40:07.939476649Z] Gathering output for process 323
[2025-08-26T09:40:07.959566183Z] Waiting for completion for process 323
[2025-08-26T09:40:08.082615370Z] Waiting for completion finished for process 323
Output and diagnostic info for process 323 was saved into 'pid-323-output.log'
[2025-08-26T09:40:08.095675574Z] Gathering output for process 344
[2025-08-26T09:40:08.096106994Z] Waiting for completion for process 344
result: Error. "main" action timed out with a timeout of 480 seconds on agent 2
test result: Error. "main" action timed out with a timeout of 480 seconds on agent 2
I checked the test artifacts. The log for the `test_native_overflow` (non-initial thread) scenario, `pid-323-output.log`, was generated and showed a pass, as expected:
--- ProcessLog ---
cmd: /root/jdk/build/linux-x86_64-server-release/images/test/hotspot/jtreg/native/invoke test_native_overflow
exitvalue: 0
stderr:
stdout: Test started with pid: 323
Java thread is alive.
Testing NATIVE_OVERFLOW
Testing stack guard page behaviour for other thread
run_native_overflow 342
Got SIGSEGV(2) at address: 0x2b4fffc94fb0
Test PASSED. Got access violation accessing guard page at 7137
Test PASSED. Not initial thread
However, the log for the failing `test_native_overflow_initial` scenario was not generated at all. This strongly suggests the process hung indefinitely in the `for(;;)` loop and was terminated before it had a chance to flush its output.
### Moving Forward
Your comments have been incredibly helpful. They've made me realize that this issue is far more complex than I first understood. My initial PR was focused squarely on fixing the hang symptom I observed on CentOS 7.
I will need to do some further investigation to fully understand the original problem that the for(;;) loop was intended to solve.
Thank you again for your guidance and patience.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/25689#issuecomment-3223693969
More information about the hotspot-runtime-dev
mailing list