[jdk18] RFR: 8273107: RunThese24H times out with "java.lang.management.ThreadInfo.getLockName()" is null

Wed Dec 15 18:36:08 UTC 2021

On Wed, 15 Dec 2021 08:34:28 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

>> RunThese24H sometimes times out with a couple of error msgs:
>> - "java.lang.management.ThreadInfo.getLockName()" is null
>> - ForkJoin common pool thread stuck
>> 
>> The '"java.lang.management.ThreadInfo.getLockName()" is null' error msg was
>> due to RunThese's use of an older JCK test suite which has since been fixed.
>> 
>> The 'ForkJoin common pool thread stuck' failure mode is likely due to a thread
>> spending a lot of time in ObjectSynchronizer::monitors_iterate() due to a
>> VM_ThreadDump::doit() call. I say "likely" because I've never been able to
>> reproduce this failure mode in testing outside of Mach5. With the Mach5
>> sightings, all we have are thread dumps and core files and not a live process.
>> 
>> The VM_ThreadDump::doit() call is trying to gather owned monitor information
>> for all threads in the system. I've seen sightings of this failure mode with > 2000
>> threads. I've also seen passing runs with > 1.7 million monitors on the in-use list.
>> Imagine searching a larger in-use list for > 2000 threads. It just doesn't scale.
>
> src/hotspot/share/runtime/vmOperations.cpp line 283:
> 
>> 281: 
>> 282:   ObjectMonitorsHashtable table;
>> 283:   ObjectMonitorsHashtable* tablep = nullptr;
> 
> It looks like you can remove this pointer.
> ThreadStackTrace::dump_stack_at_safepoint also looks at "with_locked_monitors", if false it ignores the table.
> So there should not be any need to pass in null.

Right now `ThreadStackTrace::dump_stack_at_safepoint()` is flexible in
that it does not require `ObjectMonitorsHashtable* table` to be non-null
when `_with_locked_monitors` is true. 

>From this call site, when `_with_locked_monitors` is true we always pass
a non-nullptr `ObjectMonitorsHashtable* table` and we do that because
we're at a safepoint and likely have to deal with a lot of threads if not all
the threads in the system.

For other `ThreadStackTrace::dump_stack_at_safepoint()` call sites,
I want things to be more flexible. Mostly because I haven't done
detailed analysis of them to justify changing their behavior.

> src/hotspot/share/services/threadService.cpp line 692:
> 
>> 690:       ObjectMonitorsHashtable::PtrList* list = table->get_entry(_thread);
>> 691:       if (list != nullptr) {
>> 692:         ObjectSynchronizer::monitors_iterate(&imc, list, _thread);
> 
> The InflatedMonitorsClosure walks the stack until object is found with this method:
> ThreadStackTrace::is_owned_monitor_on_stack(oop object)
> for every object...
> 
> If you instead just collect all locks on stack with one pass you don't have to walk the stack over and over, which should be major speedup.

Sounds like an interesting RFE, but I'd prefer that be investigated
separately from this bug.

-------------

PR: https://git.openjdk.java.net/jdk18/pull/25