RFR(S) 8130448: thread dump improvements, comment, additions, new diagnostics inspired by 8077392

Thu Jul 9 05:13:48 UTC 2015

On 8/07/2015 3:28 AM, Daniel D. Daugherty wrote:
> On 7/7/15 1:11 AM, David Holmes wrote:
>> Hi Dan,
>>
>> Top posting as I'm lazy and running late :)
>
> No problem with the top posting. It makes your remaining
> concerns earlier to see and address.
>
>
>> Okay ... I'm still uneasy about adding two additional flags for the
>> "release monitors on exit" situation. I hear what you are saying about
>> the cost of the monitor check - I had thought we kept easy access to
>> the set of monitors owned by a thread but it seems not, so trawling
>> the global monitor list is indeed an expensive operation. :( So okay
>> but I don't like it :)
>
> Since JavaThreadExitReleasesMonitors and GuaranteeOnMonitorMismatch
> are targeted at finding/working around bugs in the Java Monitor
> implementation, I'm going to take a look at moving the flag support
> down into Java Monitor subsystem's implementation specific options.
> That will get the options out of the global space and they won't
> have to be counted against the "cap and trade" limit :-)
>
>
>> I also take your point about viewing the new flag as a superset of the
>> JNI-detach case. That's fair enough, but of course that aspect needs
>> to be documented.
>
> I'm starting to wonder why we have the 'JNIDetachReleasesMonitors'
> option at all. Is there a good reason to be able to turn off this
> piece of code?

My recollection is that hotspot did not implement that part of the spec. 
Simply switching the behaviour with no way to restore the original 
behaviour could break existing apps. Hence a flag, on by default.

Of course by now apps should be spec compliant themselves so this seems 
a flag that should itself be deprecated in 9 and targetted for removal 
in 10. (That will help with the 'conservation of flags' problem :) ).

>
>
>> I was left unclear about your position on the VerboseStackTrace. I
>> agree the new output was likely put under a flag to avoid disturbing
>> existing output formats. But I don't see that it warrants its own flag
>> - to me using Verbose okay (though the fact it is a develop flag is
>> limiting).
>
> Thanks for reminding me that '-XX:+Verbose' is a develop flag.
> That's another reason for not using that flag. Combined with
> the fact that '-XX:+Verbose' is considered a modifier to other
> flags and thread dumps aren't enabled/triggered by a flag means
> (to me anyway) that '-XX:+Verbose' is still not a good match.
>
> However, I have a vague memory that the Java Monitor subsystem
> has its own "verbose" flag. I'll take a look at switching to
> that...

Okay

> Will be working on the next round...

Thanks,
David

> Dan
>
>
>
>>
>> Thanks,
>> David
>>
>> On 7/07/2015 6:47 AM, Daniel D. Daugherty wrote:
>>> On 7/6/15 12:43 AM, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> Metacomment: Could I suggest that in a RFR your subject line has the
>>>> form <bugid>: <synopsis>, rather than <synopsis>(bugid) - especially
>>>> when the synopsis actually starts with a bug id :) Thanks.
>>>
>>> Yup that bug synopsis is/was a mess:
>>>
>>> Changed the synopsis line to :
>>>
>>>      thread dump improvements, comment additions, new diagnostics
>>> inspired by 8077392
>>>
>>> Also updated the e-mail thread subject line to your suggested format.
>>> Will try to remember to switch to that style in the future.
>>>
>>>
>>>> Mostly okay but some major concerns around the can of worms that
>>>> JavaThreadExitReleasesMonitors opens up.
>>>
>>> This is Java monitors so "can of worms" is familiar! :-)
>>>
>>>
>>>>
>>>> On 4/07/2015 10:14 AM, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> The hunt for the following bug:
>>>>>
>>>>>      JDK-8077392 Stream fork/join tasks occasionally fail to complete
>>>>>
>>>>> and David C's work on the following bug:
>>>>>
>>>>>      JDK-8069412 Locks need better debug-printing support
>>>>>
>>>>> have inspired additional thread dump improvements, comment additions
>>>>> to some Java monitor code and some new diagnostic options.
>>>>
>>>> src/cpu/x86/vm/macroAssembler_x86.cpp
>>>>
>>>> 1821        // update _owner from from BasicLock to thread
>>>>
>>>> Typo: from from
>>>
>>> Will fix.
>>>
>>>
>>>> 2091        // the placeholder value. If that didn't work, then another
>>>> 2092        // grabbed the lock so we're done (and exit was a success).
>>>>
>>>> another -> another thread ?
>>>
>>> Yes, "another thread" is better. Will fix.
>>>
>>>
>>>> 2201       // If that didn't work, then another grabbed the lock
>>>> 2202       // so we're done (and exit was a success).
>>>>
>>>> another -> another thread ?
>>>
>>> Yes, "another thread" is better. Will fix.
>>>
>>>
>>>> ---
>>>>
>>>> src/share/vm/oops/markOop.cpp
>>>>
>>>>   40       st->print("NULL");  // this should not happen
>>>>
>>>> Shouldn't that be an assert or guarantee in that case? Or at a minimum
>>>> print something like: "NULL (but you should never see this!)"
>>>
>>> This is a relocated line from David C's change:
>>>
>>> old 49:         st->print("monitor=NULL");
>>>
>>> I simply added a comment to it. I don't think I like the idea of the
>>> thread dump code assert()'ing or guarantee()'ing because there might
>>> be other useful info that would have been printed after the assert()
>>> or guarantee() failure.
>>>
>>> I do like the idea of changing the message to:
>>>
>>> 40:       st->print("NULL (this should never be seen!)");
>>>
>>>
>>>> 42       BasicLock * bl = (BasicLock *) mon->owner();
>>>>
>>>> What information has told us this is a BasicLock* not a Thread* ? But
>>>> as all you do is print bl why not just p2i(mon->owner()) ?
>>>
>>> Again, this is a relocated line from David C's change:
>>>
>>> old 51:         BasicLock * bl = (BasicLock *) mon->owner();
>>>
>>> I'm going to guess that David C had some other code in there before
>>> that needed/assumed that the field is a "BasicLock *", but that code
>>> didn't make it into his final version.
>>>
>>> I will drop the 'bl' local and update the p2i() call to use
>>> mon->owner() directly.
>>>
>>>
>>>> ---
>>>>
>>>> src/share/vm/runtime/globals.hpp
>>>>
>>>> JavaThreadExitReleasesMonitors has me perplexed. The JNI DetachThread
>>>> case seems to be a concession to badly written code that fails to
>>>> release locks on error paths.
>>>
>>> I don't think I would call the JNIDetachReleasesMonitors case a
>>> concession
>>> to badly written JNI code. 6282335 is very clear that this code was
>>> added
>>> to meet the JNI spec. The JNI spec words might be considered a
>>> concession...
>>>
>>> http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#DetachCurrentThread
>>>
>>>
>>>
>>>  > DetachCurrentThread
>>>  >
>>>  > jint DetachCurrentThread(JavaVM *vm);
>>>  >
>>>  > Detaches the current thread from a Java VM. All Java monitors
>>>  > held by this thread are released. All Java threads waiting for
>>>  > this thread to die are notified.
>>>  >
>>>  > The main thread can be detached from the VM.
>>>
>>>
>>>
>>>> The equivalent for a Java thread would again involve lock acquisition
>>>> via JNI. But I don't recall ever seeing a bug report or RFE related to
>>>> this. What seems to have motivated this new flag is the potential for
>>>> a broken VM to actually leave the monitor locked (e.g. because
>>>> compiled code failed to do the unlock on all paths).
>>>
>>> 8077392 and bugs like it are exactly the reason for adding these
>>> two options.
>>>
>>> JavaThreadExitReleasesMonitors is added to provide a work around
>>> until the underlying bug or bugs are fixed. This option is targeted
>>> at folks that are chasing their own bugs and don't care about this
>>> one.
>>>
>>> GuaranteeOnMonitorMismatch is added to help hunt 8077392 specifically
>>> and to sanity check for similar issues generally. I can see using the
>>> two options in combination with existing test suites as a stress test
>>> for the Java monitor subsystem.
>>>
>>>
>>>> To address that I'd be inclined to simply have a guarantee in the Java
>>>> thread exit path, rather than a flag to do the clean up and another
>>>> flag to control a guarantee.
>>>
>>> The check for an unexited Java monitor is not cheap.
>>>
>>> The existing check under the JNIDetachReleasesMonitors flag is
>>> positioned to only affect threads using JNI DetachCurrentThread()
>>> and that cost is justified as part of meeting the JNI spec.
>>>
>>> Adding the unexited Java monitor check to all exiting Java
>>> threads should not be done lightly due to the cost. If the
>>> VM is functioning correctly, then why impose this cost on
>>> all exiting Java threads?
>>>
>>>
>>>> I don't think two new command-line options (has this gone through CCC
>>>> yet?) are justified.
>>>
>>> No, this has not gone through CCC yet. My preference is to get
>>> code review feedback (like this) before submitting a CCC.
>>>
>>> I suppose I could have simply added more sub-options to the
>>> existing Java monitor subsystem knobs and switches, but it
>>> seemed to make sense to model JavaThreadExitReleasesMonitors
>>> on the existing JNIDetachReleasesMonitors.
>>>
>>>
>>>> Also note that allowing for JVM induced missing unlocks is something
>>>> that can affect JNI attached threads as well as regular Java threads.
>>>> I'll take this up in discussing thread.cpp below.
>>>>
>>>> That aside, the comment, which was modelled on the JNI DetachThread
>>>> variant, does not really work in this case:
>>>>
>>>> 2801           "JavaThread exit() releases monitors owned by thread")
>>>>          \
>>>>
>>>> JavaThread is not a programming concept but an internal JVM
>>>> abstraction. I would suggest something like:
>>>>
>>>> "Java thread termination releases monitors unexpectedly still owned by
>>>> thread"
>>>
>>> I'll switch to your description text if we end up keeping the
>>> new options.
>>>
>>>
>>>> ---
>>>>
>>>> src/share/vm/runtime/objectMonitor.cpp
>>>> src/share/vm/runtime/objectMonitor.hpp
>>>>
>>>> No comments
>>>>
>>>> ---
>>>>
>>>> src/share/vm/runtime/synchronizer.cpp
>>>>
>>>> If we find the unexpected locked monitor it would be useful to see
>>>> more Java level information about the Thread (in the non-detaching
>>>> case) and the object that is locked.
>>>
>>> I can probably steal some code from the thread dump side to
>>> give us more info about the Object associated with the Java
>>> monitor...
>>>
>>>
>>>> ---
>>>>
>>>>  src/share/vm/runtime/thread.cpp
>>>>
>>>> As noted previously JNI attached threads can also be affected by JVM
>>>> induced locking bugs. So the comment block:
>>>>
>>>> 1804   // 6282335 JNI DetachCurrentThread spec states that all Java
>>>> monitors
>>>> 1805   // held by this thread must be released.  A detach operation
>>>> must only
>>>> 1806   // get here if there are no Java frames on the stack.
>>>> Therefore, any
>>>> 1807   // owned monitors at this point MUST be JNI-acquired monitors
>>>> which are
>>>> 1808   // pre-inflated and in the monitor cache.
>>>>
>>>> is not strictly correct as it is presuming a correctly operating JVM.
>>>
>>> I actually disagree with the last two sentences in that comment block,
>>> but I didn't have a good rewrite in mind. I'll think about it more.
>>>
>>>
>>>
>>>> Further the logic now applied:
>>>>
>>>> 1816   if ((exit_type == jni_detach && JNIDetachReleasesMonitors) ||
>>>> 1817       JavaThreadExitReleasesMonitors) {
>>>>
>>>> is also not "correct" because if we have
>>>> JavaThreadExitReleasesMonitors true while JNIDetachReleasesMonitors is
>>>> false, we will in fact release any JNI acquired monitors because we
>>>> can't make the distinction.
>>>
>>> I don't think JNIDetachReleasesMonitors only has to deal with
>>> JNI-acquired
>>> monitors and I don't think that JavaThreadExitReleasesMonitors only
>>> has to
>>> deal with Java code acquired monitors. I see these options as
>>> applying to
>>> "how you got to the Java thread exit path" and not applying to "what
>>> needs
>>> to be cleaned up".
>>>
>>> I see JNIDetachReleasesMonitors as a way of cleaning up unexited Java
>>> monitors when we're executing the JNI DetachCurrentThread() code path.
>>> I don't really care why the Java monitors are unexited... and I think
>>> the spec wording is clear that all unexited Java monitors are cleaned
>>> up... not just JNI-acquired monitors.
>>>
>>> I see JavaThreadExitReleasesMonitors as a way of cleaning up unexited
>>> Java monitors in any code path where we are exiting the Java thread.
>>> I see
>>> JavaThreadExitReleasesMonitors as a superset of
>>> JNIDetachReleasesMonitors.
>>>
>>>
>>>> Given we can't make the distinction I don't see any way for these
>>>> separate flags to actually work well together.
>>>
>>> I think they work well together when JavaThreadExitReleasesMonitors
>>> is seen as a superset of JNIDetachReleasesMonitors. Of course, that
>>> hinges on my opinion that JNIDetachReleasesMonitors applies to any
>>> unexited Java monitor and not just JNI-acquired Java monitors.
>>>
>>>
>>>> As I said above I'm more inclined to just add a guarantee to the
>>>> non-detach case, than add two new flags. (Even though this means there
>>>> won't be detection for such bugs when JNI attached threads encounter
>>>> them - including the main thread :( ).
>>>
>>> I really don't want to add this check to all Java thread exit
>>> paths since it is expensive.
>>>
>>>
>>>> As I said this has me perplexed. :(
>>>
>>> Hopefully, I've explained it better now.
>>>
>>>
>>>> ---
>>>>
>>>> src/share/vm/runtime/vframe.cpp
>>>>
>>>> 170       // It would be nice to distinguish between "waiting on" and
>>>> 171       // "waited on". Currently, "waiting on" here with a
>>>> 172       // java.lang.Thread.State == "WAITING (on object monitor)"
>>>> 173       // earlier in the output means that the monitor has not yet
>>>> been
>>>> 174       // notified and java.lang.Thread.State == "BLOCKED (on object
>>>> 175       // monitor)" earlier in the output means that the monitor has
>>>> 176       // been notified and the thread is blocked on reentry.
>>>>
>>>> That's a very long sentence that can be quite difficult to parse and
>>>> could probably be reworded to describe how the output should be
>>>> interpreted eg:
>>>>
>>>> // If earlier in the output we reported java.lang.Thread.State ==
>>>> // "WAITING (on object monitor)" and now we report "waited on", then
>>>> // we are still waiting for notification [or timeout?]. Otherwise if
>>>> // we earlier reported java.lang.Thread.State == "BLOCKED (on object
>>>> // monitor)", then we are actually waiting to reacquire the monitor
>>>> // lock. At this level we can't distinguish the two cases to report
>>>> // "waited on" rather than "waiting on" for the second case.
>>>
>>> I'll take a look at rewriting the comment and may just take your
>>> version entirely.
>>>
>>>
>>>> I wonder if we can prod deeper into VM state to try and make the
>>>> distinction?
>>>
>>> I played around with doing just that, but it seems like the M&M
>>> code that fetches similar state probes the state stored in the
>>> java.lang.Thread object. I didn't want to go there.
>>>
>>>
>>>>
>>>> 184       } else {
>>>> 185         st->print_cr("\t- %s <no locals available>", wait_state);
>>>>
>>>> The concept of "locals" is confusing here. Isn't the "local" in this
>>>> case just the object that we have waited upon? In which case we should
>>>> say "no object reference available".
>>>
>>> I got the phrase "locals" from the compiler code that was trying to
>>> decode the object reference. I can change to "no object reference
>>> available".
>>> I definitely want to say _something_ here. The existing code is just
>>> silent.
>>>
>>>
>>>> Though I would have thought/hoped there must be a way to find the
>>>> object reference even in a compiled frame?
>>>
>>> Yes, if you deopt... I didn't want to go there.
>>>
>>>
>>>> 195   // Print out all monitors that we have locked, are trying to lock
>>>> 196   // or are trying to relock after a wait().
>>>>
>>>> that makes it sound like there are three cases when really a "relock"
>>>> is just a specific case of lock. I would suggest:
>>>>
>>>>    // Print out all monitors that we have locked, or are trying to
>>>>    // lock, including re-locking when returning from a wait().
>>>
>>> Making it more clear that the "re-locking" is coming from "wait()"
>>> is an improvement. I'll fix this.
>>>
>>>
>>>> 252             lock_state = "waiting to relock";
>>>>
>>>> I prefer "waiting to re-lock in wait()", if that isn't too long.
>>>> Otherwise "relock" needs to be a concept people are familiar with.
>>>
>>> I like "waiting to re-lock in wait()".
>>>
>>>
>>>> 260         if (VerboseStackTrace && mark != NULL) {
>>>> 261           st->print("\t- lockbits=");
>>>> 262           mark->print_on(st);
>>>>
>>>> This really doesn't seem to warrant a new diagnostic flag, regardless
>>>> of whether we think applying -XX:+Verbose is a good fit or not.
>>>
>>> I suspect that David C didn't want to change the default output format
>>> and neither do I. We have some tests that try to parse out specific
>>> things from thread dumps to verify certain bug fixes and we don't want
>>> to break those. I'm thinking of the "duplicate owner" bug fix that we
>>> fixed in thread dumps, but I can't pull up the bug ID (yet)...
>>>
>>> Thanks for the thorough review. I have to say that I wasn't expecting
>>> much comment on this review at all. :-)
>>>
>>>
>>> Dan
>>>
>>>
>>>>
>>>> Thanks,
>>>> David
>>>> -------
>>>>
>>>>> This work is being tracked by the following bug ID:
>>>>>
>>>>>      JDK-8130448 8077392 inspired thread dump improvements, comment
>>>>>                  additions, new diagnostics
>>>>>      https://bugs.openjdk.java.net/browse/JDK-8130448
>>>>>
>>>>> Here is the webrev URL:
>>>>>
>>>>> http://cr.openjdk.java.net/~dcubed/8130448-webrev/0-jdk9-hs-rt/
>>>>>
>>>>> Testing:
>>>>>
>>>>> - RBT vm.quick batches (in process)
>>>>> - JPRT test jobs
>>>>>
>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>
>>>>> Gory details about the changes are below...
>>>>>
>>>>> Dan
>>>>>
>>>>> 8130448 summary of changes:
>>>>>
>>>>> src/cpu/x86/vm/macroAssembler_x86.cpp
>>>>>    - comment additions for the assembly code
>>>>>
>>>>> src/share/vm/oops/markOop.cpp
>>>>>    - has_monitor() has to be checked before is_locked() since
>>>>>      the has_monitor() bits are a superset of the is_locked() bits
>>>>>    - code style fixes
>>>>>
>>>>> src/share/vm/runtime/globals.hpp
>>>>>    - add VerboseStackTrace diagnostic option
>>>>>    - add GuaranteeOnMonitorMismatch diagnostic option
>>>>>    - add JavaThreadExitReleasesMonitors product option;
>>>>>      this is the Java equivalent of JNIDetachReleasesMonitors
>>>>>
>>>>> src/share/vm/runtime/objectMonitor.cpp
>>>>>    - delete unused ObjectMonitor::try_enter()
>>>>>    - fix assert wording
>>>>>
>>>>> src/share/vm/runtime/objectMonitor.hpp
>>>>>    - delete unused ObjectMonitor::try_enter()
>>>>>
>>>>> src/share/vm/runtime/synchronizer.cpp
>>>>>    - add GuaranteeOnMonitorMismatch support with some
>>>>>      diagnostic output
>>>>>
>>>>> src/share/vm/runtime/thread.cpp
>>>>>    - add JavaThreadExitReleasesMonitors support
>>>>>
>>>>> src/share/vm/runtime/vframe.cpp
>>>>>    - clarify existing comments
>>>>>    - add comments to clarify what "waiting on" means
>>>>>    - add line to show when we are "waiting on", but
>>>>>      there are no locals available (previously we were
>>>>>      "waiting on" silently)
>>>>>    - add support for "waiting to relock" which can happen
>>>>>      for any of the monitors and not just the top monitor
>>>>>    - switch mark->print_on() verbose output to new
>>>>>      VerboseStackTrace switch; thread dumps are not enabled
>>>>>      with a specific switch so the '-XX:+Verbose' model of
>>>>>      being a modifier for another option does not fit
>>>
>