RFR: 8327743: JVM crash in hotspot/share/runtime/javaThread.cpp - failed: held monitor count should be equal to jni: 0 != 1 [v5]

Richard Reingruber rrich at openjdk.org
Wed Apr 10 09:04:04 UTC 2024


On Tue, 9 Apr 2024 22:00:22 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> The crux of the problem here is that the virtual thread code was not keeping the held-monitor-count and jni-monitor-count in sync under all conditions. So if a vthread acquired a monitor via JNI but failed to unlock it before terminating, the underlying platform thread's counts were out of sync and if it terminated we would trigger the assertion that checks for such things.
>> 
>> The actual fix is very simple: we zero the platform thread's jni-monitor-count in `continuation_enter_cleanup` the same way we zero the held-monitor-count. In addition we apply the same `CheckJNICalls` check for this unbalanced locking and issue a warning in the virtual thread case. That fact this happens in asm code complicates matters.
>> 
>> The existing `JNIMonitor.java` test is greatly expanded to test these scenarios and check the unified logging output.
>> 
>> Other minor changes involve expanding some of the other assertions relating to the two counts so we can detected a mismatch earlier without a need for the thread to terminate. And the test that original uncovered the problem (`GetOwnedMonitorInfoTest.java`) has some minor adjustments to enhance diagnostics.
>> 
>> I've provided the fix for all architectures that support continuations: x64, aarch64, riscv and ppc. The latter both build okay in GHA but I can't actually test them with the updated test. So some assistance from RISCV folk (@robehn ?) and PPC folk (??) would be appreciated (otherwise any issues will have to be handled as follow up fixes
>> 
>> The changes are structured so that there is no extra code executed in product builds unless `CheckJNICalls` is set. This means that product builds will not keep the JNI count in sync with the held count, unless `CheckJNICalls` is set. This could trip up a future logging entry or explicit check of the JNI count, but it is expected that these counts will be removed once ObjectMonitor usage will not force virtual thread pinning.
>> 
>> Testing:
>>  - regression test 10x on all x64 and aarch64 platforms
>>  - tiers 1-4
>>  - GHA
>> 
>> 
>> Thanks to @pchilano for help working out the best form of the fix and the initial asm for x64.
>> 
>> Thanks to @fbredber for the Aarch64 and RISCV asm code.
>> 
>> Thanks
>
> David Holmes has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Cleanup test leftovers

Testing didn't bring up any issue. I've only got a few minor comments.
Thanks, Richard.

src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1:

> 1: /*

Copyright header needs update

src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1673:

> 1671:     // Save return value potentially containing the exception oop
> 1672:     Register ex_oop = R15_esp;   // nonvolatile register
> 1673:     __ mr(ex_oop, R3_RET);

Please add `R15_esp` to the Kills section above in the header comment.

src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1675:

> 1673:     __ mr(ex_oop, R3_RET);
> 1674:     __ call_VM_leaf(CAST_FROM_FN_PTR(address, SharedRuntime::log_jni_monitor_still_held));
> 1675:     // Restore potentional return value

Nit
Suggestion:

    // Restore potential return value

src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1680:

> 1678:     // For vthreads we have to explicitly zero the JNI monitor count of the carrier
> 1679:     // on termination. The held count is implicitly zeroed below when we restore from
> 1680:     // the parent held count (which has to be zero).

This comment is not quite correct or a little imprecise, I found.

> the parent held count (which has to be zero)

I think technically the held count (_held_monitor_count) could be non-zero. Of course it would likely be bad if it was (holding a monitor while suspended is usually not good).

I thought it was like this:


The JNI monitor count of the carrier thread is required to be zero when
switching to the virtual thread. Here we are switching back to the carrier.
We have to restore its JNI monitor count of zero.

-------------

PR Review: https://git.openjdk.org/jdk/pull/18445#pullrequestreview-1991008592
PR Review Comment: https://git.openjdk.org/jdk/pull/18445#discussion_r1559081058
PR Review Comment: https://git.openjdk.org/jdk/pull/18445#discussion_r1559078219
PR Review Comment: https://git.openjdk.org/jdk/pull/18445#discussion_r1559002331
PR Review Comment: https://git.openjdk.org/jdk/pull/18445#discussion_r1559059557


More information about the hotspot-dev mailing list