RFR: 8027429: Add diagnostic command VM.info to get hs_err print-out

Tue Nov 3 15:25:15 UTC 2015

Hi,

I've done a bit of refactoring with this change and have tested it with 
and without intentional crashes to the VM.info reporting.  If it crashes 
within VM.info, there's a stack trace that clearly shows where it comes 
from in the hs_err file.  And the hs_err_pid file is correctly 
produced.  If we can make the jcmd come with a warning that it *could* 
potentially crash the vm, I think this feature is a good addition for 
sustaining.

Without crashing, VM.info gives a lot of information that we normally 
give with the hs_err file.  The reason for the careful STEP'ping in 
hs_err production is that the JVM has already crashed so we can't count 
on anything.  The VM.info command is designed to be run on a moderately 
stable JVM so is less likely to crash on the things that we're printing out.

I think we should fix the Threads walking code that Fred pointed out to 
take the Threads_lock, but I don't think we should *copy* code for 
VM.info or copy print_on_error functions.

Thanks,
Coleen

On 11/3/15 10:18 AM, Bertrand Delsart wrote:
> On 03/11/2015 15:06, Bertrand Delsart wrote:
>> Thanks Thomas,
>>
>> I fully agree it can work. The problem is that it may also fail if we
>> are not careful.
>>
>> Fred noticed for instance that "Threads::print_on_error" is not
>> protected by _vm_info_cmd (step 290). This is an obvious source of
>> potential crashes if you take into account the fact that the
>> print_on_error methods were designed not to use locks.
>
> Fred just pointed out that this may be OK since _thread is NULL.
>
> I'm not sure of why we do not print the threads when there is notion 
> of current thread (and I would wonder whether we should print them for 
> VMInfo) but luckily that may have been sufficient to prevent the crash.
>
> However, I still do not like it from a design point of view for the 
> other reasons stated in my e-mail.
>
> Bertrand.
>
>>
>> It seems that the focus has been on removing what was clearly useless,
>> not what might be dangerous due to how the method (and what it calls)
>> has been designed.
>>
>> In addition, adding _vm_info_cmd is IMHO not sufficient unless you pass
>> that argument to all layers. This is the biggest design issue. For
>> instance all calls to print_on_error could have this additional argument
>> so that the different components can adapt their behavior (and are aware
>> there are now two use cases). I would start by Threads::print_on_error,
>> grabbing the threads lock in the info case, but the same if true for all
>> methods called by VMError::report. I would go as far as adding a comment
>> in all print_on_error methods stating why the method is safe for both
>> use cases (e.g. why no locks are needed) or adding
>> "assert(!vm_info_cmd)" for methods VMError:report is supposed to skip.
>> The current design, where the various components may not be aware that
>> they are called for two different purposes, is not robust enough.
>>
>> Note also that being 'robust enough for the "crash" case' is not a
>> strong enough guarantee for a general purpose info DCmd... particularly
>> if, as you point out, an error that occurs in one of the info STEP
>> crashes the VM instead of just bringing us to the next error reporting
>> STEP. Now, as stated before, I could agree if we use this DCmd only when
>> a crash is acceptable.
>>
>> Regards,
>>
>> Bertrand
>>
>> On 02/11/2015 14:32, Thomas Stüfe wrote:
>>> Hi all,
>>>
>>> we have implemented in the SAP JVM the same mechanism. Our
>>> implementation is in use since a number of years and is quite 
>>> useful. So
>>> I would agree with Coleen.
>>>
>>> However, there are some things to keep in mind when doing this:
>>>
>>> Error handling is different between the "info-only" and "crash" 
>>> case. In
>>> "info-only" case, VMError::report() is called without enclosing
>>> VMError::report_and_die() and we run with the normal signal handler
>>> intact, not with the secondary signal handler. So, normal signal
>>> handling works, which is good. But it also means that any error inside
>>> one of the error reporting STEPs will bring down the VM, instead of 
>>> just
>>> bringing us to the next error reporting STEP ("Error occurred during
>>> error reporting").
>>>
>>> The problem is that a number of error reporting steps are a bit risky
>>> and written with the assurance that the worst which can happen is that
>>> the STEP will be interrupted. These risky steps should not be executed
>>> in the "only-info" case.
>>>
>>> But the proposed change already does this, it introduces "_vm_info_cmd"
>>> and conditionally excludes dangerous or pointless error reporting STEPs
>>> (e.g. callstack is pointless).
>>>
>>> As far as I can see, the other concerns concern only the "crash" case
>>> and simply do not apply to the "info-only" case. If an error reporting
>>> step is robust enough for the "crash" case (low stack usage, not
>>> allocating memory to avoid deadlocks etc), it should be safe enough for
>>> the "info" case.
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Nov 2, 2015 at 12:31 PM, Mattis Castegren
>>> <mattis.castegren at oracle.com <mailto:mattis.castegren at oracle.com>> 
>>> wrote:
>>>
>>>     Hi
>>>
>>>     Of course, when I wrote "Review and push this fix" I meant doing a
>>>     thorough technical review. We do not want VM.info to be unsafe, and
>>>     I will let David reply to the technical questions that has been
>>>     raised. I know he did consider some of these issues, but last week
>>>     he was on Java one, and this week he is out. I will let him comment
>>>     once he gets back.
>>>
>>>     Kind Regards
>>>     /Mattis
>>>
>>>     -----Original Message-----
>>>     From: Bertrand Delsart
>>>     Sent: den 2 november 2015 10:13
>>>     To: David Holmes; Mattis Castegren; Coleen Phillimore;
>>>     hotspot-runtime-dev at openjdk.java.net
>>>     <mailto:hotspot-runtime-dev at openjdk.java.net>
>>>     Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>>>     hs_err print-out
>>>
>>>     On 02/11/2015 02:45, David Holmes wrote:
>>>      > On 30/10/2015 8:38 PM, Mattis Castegren wrote:
>>>      >> Hi
>>>      >>
>>>      >> I agree that there may be things we could add to VM.info 
>>> that we
>>>      >> couldn't add in an hs_err file due to the fact that we are
>>> crashing.
>>>      >> However, I really do believe that we should use the same code.
>>> The
>>>      >> goal of the hs_err file is exactly the same as the goal of
>>> VM.info,
>>>      >> to print as much information about the system as possible so 
>>> that
>>>      >> when we look at an hs_err file or the output from VM.info, we
>>> get as
>>>      >> much information about the system as possible. That way we can
>>>     reduce
>>>      >> the need to go back and forth with the customer to ask for more
>>>     information.
>>>      >>
>>>      >> If we enumerate what we need in VM.info, the list would look 
>>> very
>>>      >> much like what we already do in VMError, and therefore I 
>>> think we
>>>      >> should use the same code. If we find something that we would
>>> want,
>>>      >> that is not currently printed in hs_err files, I think we have
>>>     two options:
>>>      >>
>>>      >> 1) Add it to hs_err files as well. If it is interesting when
>>>      >> gathering customer information before a crash, it probably is
>>>      >> interesting after a crash
>>>      >> 2) If it for some reason cannot be captured when crashing, add
>>> it to
>>>      >> the code with the condition that we don't print it during crash
>>>     time.
>>>      >>
>>>      >> I would therefore suggest the following plan of action
>>>      >>
>>>      >> 1) Review and push this fix to see if there are any technical
>>> issues
>>>      >> with regards to thread safety, etc. I know David has had 
>>> this in
>>>     mind
>>>      >> when doing the change. After that, we have a command that
>>>     support can
>>>      >> use.
>>>      >> 2) Think about if there is any information missing, both from
>>> hs_err
>>>      >> files and from VM.info. Here, we may want to ask support 
>>> about if
>>>      >> there is any additional information they need to ask for when
>>> they
>>>      >> get this data. If possible, this information should be added
>>> to both
>>>      >> hs_err files and VM.info, we don't want support to have to ask
>>> for
>>>      >> VM.info if we already have an hs_err file. Of course, if 
>>> there is
>>>      >> anything Dev would want to see when analyzing crashes in
>>> testing, we
>>>      >> should add that too.
>>>      >
>>>      > I have no issue with VMError and VM.Info using the same code
>>> (where
>>>      > appropriate), but I think it wrong to try and implement VM.Info
>>>      > directly using the actual VMError mechanism. It is crude and
>>> makes it
>>>      > difficult to make independent changes to either facility.
>>>      >
>>>      > Sorry.
>>>      >
>>>      > David
>>>
>>>     Furthermore, we cannot just "Review and push this fix to see if
>>>     there are any technical issues". Two reviewers have already stated
>>>     that there are issues and these issues must be addressed before
>>> pushing.
>>>
>>>     The answer cannot just be that we need this feature, it must 
>>> address
>>>     the technical issues that were reported.
>>>
>>>     One possible answer is that we all agree that VMInfo would be 
>>> unsafe.
>>>     The method would have to be renamed UnsafeVMInfo and it must be
>>>     clear that its 'impact' would be very risky, potentially crashing
>>>     the JVM.
>>>
>>>     If this is sufficient for your needs then we could agree with 
>>> it. My
>>>     only additional requirement would be to clearly mark that the 
>>> method
>>>     has been called so that we do not waste time on non-issues if
>>>     calling VMInfo leads to a crash (or, even worse but less likely, to
>>>     a silent data corruption that may lead to problems later). The bare
>>>     minimum would be to double check that a crash during an 
>>> execution of
>>>     VM.info is clearly identifiable (this is worth double checking
>>>     because of the use of static buffers for the error reporting, see
>>>     below).
>>>
>>>     If an 'unsafe' VMInfo DCmd is not sufficient, you should wonder
>>>     whether a safe version of it is still a "small enhancement". A JEP
>>>     may be required.
>>>
>>>     VMError::report and the methods it calls (print_on_error) have been
>>>     written with assumptions clearly stated in the comments above
>>>     VMError::report (see below the two most important ones). HotSpot
>>>     error reporting strategy differed from JRockit. This explains why
>>>     the JRockit approach cannot be applied as is.
>>>
>>>     I agree with David that you can try leveraging some of the logic
>>>     from VMError::report. However, to be sure there is no
>>>     misunderstanding, let's be clearer about David's "(where
>>>     appropriate)": we would for instance have to review all the called
>>>     methods, check which ones purposefully ignore locking (potentially
>>>     crashing) and provide alternate code for the unsafe ones (which may
>>>     not be trivial in some cases). As a simple example, a method called
>>>     "print_on_error" should not be called when we are not in an
>>>     erroneous condition. However, in some components of the JVM, the 
>>> low
>>>     level implementations of 'print_on_error' and 'print_info_safely'
>>>     could share the same code.
>>>
>>>     Regards,
>>>
>>>     Bertrand.
>>>
>>>     Here are the important comments about VMError::report assumptions:
>>>
>>>     // In general, a hang in error handler is much worse than a 
>>> crash or
>>>     internal
>>>     // error, as it's harder to recover from a hang. Deadlock can 
>>> happen
>>>     if we
>>>     // try to grab a lock that is already owned by current thread, or
>>> if the
>>>     // owner is blocked forever (e.g. in os::infinite_sleep()). If
>>>     possible, the
>>>     // error handler and all the functions it called should avoid
>>>     grabbing any
>>>     // lock. An important thing to notice is that memory allocation
>>> needs a
>>>     lock.
>>>     //
>>>     // We should avoid using large stack allocated buffers. Many errors
>>>     happen
>>>     // when stack space is already low. Making things even worse is 
>>> that
>>>     there
>>>     // could be nested report_and_die() calls on stack (see above).
>>> Only one
>>>     // thread can report error, so large buffers are statically
>>> allocated in
>>>     data
>>>     // segment.
>>>
>>>
>>>
>>>      >
>>>      >
>>>      >> Kind Regards
>>>      >> /Mattis
>>>      >>
>>>      >> -----Original Message-----
>>>      >> From: David Holmes
>>>      >> Sent: den 29 oktober 2015 23:20
>>>      >> To: Mattis Castegren; Coleen Phillimore;
>>>      >> hotspot-runtime-dev at openjdk.java.net
>>>     <mailto:hotspot-runtime-dev at openjdk.java.net>
>>>      >> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to 
>>> get
>>>      >> hs_err print-out
>>>      >>
>>>      >> On 29/10/2015 11:36 PM, Mattis Castegren wrote:
>>>      >>> Some background: We have this command in JRockit. The
>>>     information you
>>>      >>> gather when you crash to give a good summary of what system
>>> you run
>>>      >>> on is pretty much exactly the information you need to get a 
>>> good
>>>      >>> summary on a system that has not crashed. The JRockit 
>>> command is
>>>      >>> extremely useful for support, and saves a lot of work going
>>>     back and
>>>      >>> forth asking about system information.
>>>      >>>
>>>      >>> Also, if we write something new in the hs_err file, like if
>>>     there has
>>>      >>> been any out of memory errors, we often would want the same
>>>      >>> information in the VM.info output. From my experience in
>>>      >>> Sustaining/Support, I can't think of any information you
>>> would want
>>>      >>> in VM.info that you wouldn't also want in the hs_err file and
>>> the
>>>      >>> other way around, apart from details about the crash
>>> (obviously).
>>>      >>>
>>>      >>> I don't see a reason to exactly enumerate what information
>>> VM.info
>>>      >>> should provide. From a sustaining/support perspective, we 
>>> want a
>>>      >>> one-stop command to gather as much useful information as
>>> possible,
>>>      >>> which is the same idea we have for the hs_err file.
>>>      >>
>>>      >> The reason to enumerate what is required is to see where that
>>>      >> information already exists and can be collected from. The 
>>> VMError
>>>      >> report has to be very careful about what it does and how it
>>> does it
>>>      >> because of the fact we may have crashed and the general process
>>>     state
>>>      >> is indeterminate. A Dcmd can simply gather whatever
>>> information is
>>>      >> required from the available sources in whatever order or format
>>>     desired.
>>>      >>
>>>      >> I have no problem with this command, just how it has been
>>>     proposed to
>>>      >> implement it.
>>>      >>
>>>      >> David
>>>      >>
>>>      >>> Kind Regards
>>>      >>> /Mattis
>>>      >>>
>>>      >>> -----Original Message-----
>>>      >>> From: David Holmes
>>>      >>> Sent: den 29 oktober 2015 13:08
>>>      >>> To: Coleen Phillimore; hotspot-runtime-dev at openjdk.java.net
>>>     <mailto:hotspot-runtime-dev at openjdk.java.net>
>>>      >>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info 
>>> to get
>>>      >>> hs_err print-out
>>>      >>>
>>>      >>> On 29/10/2015 10:02 PM, Coleen Phillimore wrote:
>>>      >>>>
>>>      >>>>
>>>      >>>> On 10/29/15 7:58 AM, David Holmes wrote:
>>>      >>>>> On 29/10/2015 9:26 PM, Coleen Phillimore wrote:
>>>      >>>>>>
>>>      >>>>>> So I actually disagree.  I don't think there should be an
>>>      >>>>>> additional separate mechanism to get the same information
>>>     that we
>>>      >>>>>> get with hs_err reporting.  I've wanted to have a feature
>>> like
>>>      >>>>>> this for a long time.
>>>      >>>>>>
>>>      >>>>>> I pre-reviewed this change and I thought it looked good in
>>>     general.
>>>      >>>>>> I didn't see the thread iteration problem that Fred 
>>> refers to
>>>      >>>>>> below, but I think the individual problems can be fixed.
>>>      >>>>>>
>>>      >>>>>> The last thing I want is this code to be copied somewhere
>>> else.
>>>      >>>>>
>>>      >>>>> Factored as needed not copied. VMError is not an "info"
>>> reporting
>>>      >>>>> mechanism.
>>>      >>>>
>>>      >>>> If you look at the things that are reported in each "STEP",
>>>     there's a
>>>      >>>> small amount of code and the order is important.
>>>      >>>
>>>      >>> The order is less important in an "info" request I would 
>>> think.
>>>      >>>
>>>      >>>> I'd like the vm info to use the same order and report what it
>>>     can do
>>>      >>>> safely.   Refactoring 5 lines of code into functions doesn't
>>> make
>>>      >>>> sense.
>>>      >>>
>>>      >>> I need to consider exactly what it is the "info" needs to
>>> report in
>>>      >>> more detail. There are existing facilities (system properties,
>>>      >>> management
>>>      >>> APIs) for various bits of runtime information, which VMError
>>> can't
>>>      >>> utilize but a Dcmd can.
>>>      >>>
>>>      >>> David
>>>      >>>
>>>      >>>> Coleen
>>>      >>>>
>>>      >>>>>
>>>      >>>>> David
>>>      >>>>>
>>>      >>>>>> Coleen
>>>      >>>>>>
>>>      >>>>>> On 10/28/15 8:48 PM, David Holmes wrote:
>>>      >>>>>>> I agree with Fred, this kind of info reporting should 
>>> not be
>>>      >>>>>>> piggy-backed onto VMError handling for the reasons stated
>>>     (and the
>>>      >>>>>>> error handling logic is complicated enough as it is!). For
>>>     things
>>>      >>>>>>> like thread lists there are already safe management
>>>     functions that
>>>      >>>>>>> can/should be used.
>>>      >>>>>>>
>>>      >>>>>>> Thanks,
>>>      >>>>>>> David H.
>>>      >>>>>>>
>>>      >>>>>>> On 29/10/2015 3:29 AM, Frederic Parain wrote:
>>>      >>>>>>>> Hi David,
>>>      >>>>>>>>
>>>      >>>>>>>> I haven't review all the code yet, but I'm concerned
>>> with the
>>>      >>>>>>>> fact that the diagnostic command is calling
>>> VMError::report().
>>>      >>>>>>>> This method has been implemented to be executed in the
>>>     particular
>>>      >>>>>>>> context of fatal errors, and its usage while the VM is
>>> running
>>>      >>>>>>>> normally seems dangerous.
>>>      >>>>>>>>
>>>      >>>>>>>> For instance, VMError::report() consciously avoids
>>>     grabbing locks
>>>      >>>>>>>> because of the risk of deadlock during the error 
>>> reporting.
>>>      >>>>>>>> The consequence is that some data structures are browsed
>>> in an
>>>      >>>>>>>> unsafe way. One example: VMError::report() calls
>>>      >>>>>>>> Threads::print_on_error() which iterates over the thread
>>> list
>>>      >>>>>>>> *without owning the Threads_lock*.
>>>      >>>>>>>>
>>>      >>>>>>>> The implementation of the diagnostic command seems 
>>> also to
>>>      >>>>>>>> exclude a lot of reporting from the initial
>>> VMError::report()
>>>      >>>>>>>> method. Have you considered implementing a new MT-safe
>>>     reporting
>>>      >>>>>>>> method rather than trying to modify the special
>>>     VMError::report()
>>>      >>>>>>>> methods? (Note that some code factorization between
>>>      >>>>>>>> VMError::report() and this new method should be 
>>> possible).
>>>      >>>>>>>>
>>>      >>>>>>>> Thanks,
>>>      >>>>>>>>
>>>      >>>>>>>> Fred
>>>      >>>>>>>>
>>>      >>>>>>>> On 28/10/2015 17:18, david buck wrote:
>>>      >>>>>>>>> Hi!
>>>      >>>>>>>>>
>>>      >>>>>>>>> Please review my change for this small enhancement.
>>>      >>>>>>>>>
>>>      >>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8027429
>>>      >>>>>>>>> webrev: 
>>> http://cr.openjdk.java.net/~dbuck/8027429_jdk9_01/
>>>      >>>>>>>>>
>>>      >>>>>>>>> Cheers,
>>>      >>>>>>>>> -Buck
>>>      >>>>>>>>
>>>      >>>>>>
>>>      >>>>
>>>
>>>
>>>     --
>>>     Bertrand Delsart,                     Grenoble Engineering Center
>>>     Oracle,         180 av. de l'Europe,          ZIRST de Montbonnot
>>>     38330 Montbonnot Saint Martin, FRANCE
>>>     bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>
>>>               Phone : +33 4 76 18 81 23
>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>     NOTICE: This email message is for the sole use of the intended
>>>     recipient(s) and may contain confidential and privileged
>>>     information. Any unauthorized review, use, disclosure or
>>>     distribution is prohibited. If you are not the intended recipient,
>>>     please contact the sender by reply email and destroy all copies of
>>>     the original message.
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>
>>
>
>