[jdk18] RFR: 8273107: RunThese24H times out with "java.lang.management.ThreadInfo.getLockName()" is null

Thu Dec 16 16:26:00 UTC 2021

On Thu, 16 Dec 2021 06:44:03 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> RunThese24H sometimes times out with a couple of error msgs:
>> - "java.lang.management.ThreadInfo.getLockName()" is null
>> - ForkJoin common pool thread stuck
>> 
>> The '"java.lang.management.ThreadInfo.getLockName()" is null' error msg was
>> due to RunThese's use of an older JCK test suite which has since been fixed.
>> 
>> The 'ForkJoin common pool thread stuck' failure mode is likely due to a thread
>> spending a lot of time in ObjectSynchronizer::monitors_iterate() due to a
>> VM_ThreadDump::doit() call. I say "likely" because I've never been able to
>> reproduce this failure mode in testing outside of Mach5. With the Mach5
>> sightings, all we have are thread dumps and core files and not a live process.
>> 
>> The VM_ThreadDump::doit() call is trying to gather owned monitor information
>> for all threads in the system. I've seen sightings of this failure mode with > 2000
>> threads. I've also seen passing runs with > 1.7 million monitors on the in-use list.
>> Imagine searching a larger in-use list for > 2000 threads. It just doesn't scale.
>
> Looks okay, but I found a potential issue.

@fisk - Thanks for the review!

> src/hotspot/share/runtime/synchronizer.cpp line 79:
> 
>> 77:   if (list != nullptr) {
>> 78:     // Add to existing list from the hash table:
>> 79:     list->add(om);
> 
> These two lines are executed in both the if and else clause. Consider putting it after instead.

I'll take a look at doing that. Thanks for the suggestion.

> src/hotspot/share/runtime/synchronizer.cpp line 1062:
> 
>> 1060:                                           JavaThread* thread) {
>> 1061:   typedef LinkedListIterator<ObjectMonitor*> ObjectMonitorIterator;
>> 1062:   ObjectMonitorIterator iter = ObjectMonitorIterator(list->head());
> 
> This seems to create an object and then copy it to iter. I think the more obvious syntax is ObjectMonitorIterator iter(list->head());

Nice catch. I'll fix that.

> src/hotspot/share/runtime/synchronizer.cpp line 1423:
> 
>> 1421:       // deflate_idle_monitors() and deflate_monitor_list() can be called more than
>> 1422:       // once, we have to make sure the entry has not already been added.
>> 1423:       JavaThread* jt = (JavaThread*)mid->owner();
> 
> Hmmmm. I'm not sure this is correct. The owned of an ObjectMonitor is a JavaThread most of the time, but not always. I think that if the owner started off with a stack lock, and another thread inflated the lock due to contention, then the ObjectMonitor will get the stack lock as the owner, as the random other thread is too lazy to look up which thread stack the stack lock belongs to.
> 
> Because of that, I think we might miss locks that really are owned by a thread, as the thread key now will not map to such locks. Some kind of ownership lookup probably needs to happen here.

Nice catch on an oddity of ObjectMonitor ownership!

The baseline code has the same "issue":

`void ObjectSynchronizer::monitors_iterate(MonitorClosure* closure, JavaThread* thread) {`

`<snip>`

    ObjectMonitor* mid = iter.next();
    if (mid->owner() != thread) {
      continue;
    }

Please notice that I said "issue" and not "bug". This isn't a bug and you really
won't like the reason that it's not a bug. I know that I don't like it...

Because of this ownership check `monitors_iterate()` won't apply the closure
(InflatedMonitorsClosure in this case) to an ObjectMonitor that is marked as
owned by a stack lock address because it will never match the target JavaThread*.

Here's the closure:

// Iterate through monitor cache to find JNI locked monitors
class InflatedMonitorsClosure: public MonitorClosure {
private:
  ThreadStackTrace* _stack_trace;
public:
  InflatedMonitorsClosure(ThreadStackTrace* st) {
    _stack_trace = st;
  }
  void do_monitor(ObjectMonitor* mid) {
    oop object = mid->object();
    if (!_stack_trace->is_owned_monitor_on_stack(object)) {
      _stack_trace->add_jni_locked_monitor(object);
    }
  }
};

This particular monitors_iterate() call uses InflatedMonitorsClosure to
gather just the JNI locked monitors owned by the target JavaThread*.
A JNI locked monitor will never have a stack lock address as its owner
marking. It will always have a JavaThread*. So the baseline code isn't
"broken" because it will find the JNI locked monitors just fine. Similarly
the new code isn't "broken" because it will also find the JNI locked
monitors just fine.

I definitely need to clarify the baseline comments and the new code
comments to make this subtlety much more clear.

-------------

PR: https://git.openjdk.java.net/jdk18/pull/25