RFR: 8027429: Add diagnostic command VM.info to get hs_err print-out

Tue Nov 3 15:51:37 UTC 2015

Hi Betrand,

I see your point and cannot really argue with them. We did spent some time
to make our version of VmError::report() safe for the "info-only" use and
live with the remaining risk.

Kind Regards, Thomas

On Tue, Nov 3, 2015 at 3:06 PM, Bertrand Delsart <
bertrand.delsart at oracle.com> wrote:

> Thanks Thomas,
>
> I fully agree it can work. The problem is that it may also fail if we are
> not careful.
>
> Fred noticed for instance that "Threads::print_on_error" is not protected
> by _vm_info_cmd (step 290). This is an obvious source of potential crashes
> if you take into account the fact that the print_on_error methods were
> designed not to use locks.
>
> It seems that the focus has been on removing what was clearly useless, not
> what might be dangerous due to how the method (and what it calls) has been
> designed.
>
> In addition, adding _vm_info_cmd is IMHO not sufficient unless you pass
> that argument to all layers. This is the biggest design issue. For instance
> all calls to print_on_error could have this additional argument so that the
> different components can adapt their behavior (and are aware there are now
> two use cases). I would start by Threads::print_on_error, grabbing the
> threads lock in the info case, but the same if true for all methods called
> by VMError::report. I would go as far as adding a comment in all
> print_on_error methods stating why the method is safe for both use cases
> (e.g. why no locks are needed) or adding "assert(!vm_info_cmd)" for methods
> VMError:report is supposed to skip. The current design, where the various
> components may not be aware that they are called for two different
> purposes, is not robust enough.
>
> Note also that being 'robust enough for the "crash" case' is not a strong
> enough guarantee for a general purpose info DCmd... particularly if, as you
> point out, an error that occurs in one of the info STEP crashes the VM
> instead of just bringing us to the next error reporting STEP. Now, as
> stated before, I could agree if we use this DCmd only when a crash is
> acceptable.
>
> Regards,
>
> Bertrand
>
>
> On 02/11/2015 14:32, Thomas Stüfe wrote:
>
>> Hi all,
>>
>> we have implemented in the SAP JVM the same mechanism. Our
>> implementation is in use since a number of years and is quite useful. So
>> I would agree with Coleen.
>>
>> However, there are some things to keep in mind when doing this:
>>
>> Error handling is different between the "info-only" and "crash" case. In
>> "info-only" case, VMError::report() is called without enclosing
>> VMError::report_and_die() and we run with the normal signal handler
>> intact, not with the secondary signal handler. So, normal signal
>> handling works, which is good. But it also means that any error inside
>> one of the error reporting STEPs will bring down the VM, instead of just
>> bringing us to the next error reporting STEP ("Error occurred during
>> error reporting").
>>
>> The problem is that a number of error reporting steps are a bit risky
>> and written with the assurance that the worst which can happen is that
>> the STEP will be interrupted. These risky steps should not be executed
>> in the "only-info" case.
>>
>> But the proposed change already does this, it introduces "_vm_info_cmd"
>> and conditionally excludes dangerous or pointless error reporting STEPs
>> (e.g. callstack is pointless).
>>
>> As far as I can see, the other concerns concern only the "crash" case
>> and simply do not apply to the "info-only" case. If an error reporting
>> step is robust enough for the "crash" case (low stack usage, not
>> allocating memory to avoid deadlocks etc), it should be safe enough for
>> the "info" case.
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Nov 2, 2015 at 12:31 PM, Mattis Castegren
>> <mattis.castegren at oracle.com <mailto:mattis.castegren at oracle.com>> wrote:
>>
>>     Hi
>>
>>     Of course, when I wrote "Review and push this fix" I meant doing a
>>     thorough technical review. We do not want VM.info to be unsafe, and
>>     I will let David reply to the technical questions that has been
>>     raised. I know he did consider some of these issues, but last week
>>     he was on Java one, and this week he is out. I will let him comment
>>     once he gets back.
>>
>>     Kind Regards
>>     /Mattis
>>
>>     -----Original Message-----
>>     From: Bertrand Delsart
>>     Sent: den 2 november 2015 10:13
>>     To: David Holmes; Mattis Castegren; Coleen Phillimore;
>>     hotspot-runtime-dev at openjdk.java.net
>>     <mailto:hotspot-runtime-dev at openjdk.java.net>
>>     Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>>     hs_err print-out
>>
>>     On 02/11/2015 02:45, David Holmes wrote:
>>      > On 30/10/2015 8:38 PM, Mattis Castegren wrote:
>>      >> Hi
>>      >>
>>      >> I agree that there may be things we could add to VM.info that we
>>      >> couldn't add in an hs_err file due to the fact that we are
>> crashing.
>>      >> However, I really do believe that we should use the same code. The
>>      >> goal of the hs_err file is exactly the same as the goal of
>> VM.info,
>>      >> to print as much information about the system as possible so that
>>      >> when we look at an hs_err file or the output from VM.info, we get
>> as
>>      >> much information about the system as possible. That way we can
>>     reduce
>>      >> the need to go back and forth with the customer to ask for more
>>     information.
>>      >>
>>      >> If we enumerate what we need in VM.info, the list would look very
>>      >> much like what we already do in VMError, and therefore I think we
>>      >> should use the same code. If we find something that we would want,
>>      >> that is not currently printed in hs_err files, I think we have
>>     two options:
>>      >>
>>      >> 1) Add it to hs_err files as well. If it is interesting when
>>      >> gathering customer information before a crash, it probably is
>>      >> interesting after a crash
>>      >> 2) If it for some reason cannot be captured when crashing, add it
>> to
>>      >> the code with the condition that we don't print it during crash
>>     time.
>>      >>
>>      >> I would therefore suggest the following plan of action
>>      >>
>>      >> 1) Review and push this fix to see if there are any technical
>> issues
>>      >> with regards to thread safety, etc. I know David has had this in
>>     mind
>>      >> when doing the change. After that, we have a command that
>>     support can
>>      >> use.
>>      >> 2) Think about if there is any information missing, both from
>> hs_err
>>      >> files and from VM.info. Here, we may want to ask support about if
>>      >> there is any additional information they need to ask for when they
>>      >> get this data. If possible, this information should be added to
>> both
>>      >> hs_err files and VM.info, we don't want support to have to ask for
>>      >> VM.info if we already have an hs_err file. Of course, if there is
>>      >> anything Dev would want to see when analyzing crashes in testing,
>> we
>>      >> should add that too.
>>      >
>>      > I have no issue with VMError and VM.Info using the same code (where
>>      > appropriate), but I think it wrong to try and implement VM.Info
>>      > directly using the actual VMError mechanism. It is crude and makes
>> it
>>      > difficult to make independent changes to either facility.
>>      >
>>      > Sorry.
>>      >
>>      > David
>>
>>     Furthermore, we cannot just "Review and push this fix to see if
>>     there are any technical issues". Two reviewers have already stated
>>     that there are issues and these issues must be addressed before
>> pushing.
>>
>>     The answer cannot just be that we need this feature, it must address
>>     the technical issues that were reported.
>>
>>     One possible answer is that we all agree that VMInfo would be unsafe.
>>     The method would have to be renamed UnsafeVMInfo and it must be
>>     clear that its 'impact' would be very risky, potentially crashing
>>     the JVM.
>>
>>     If this is sufficient for your needs then we could agree with it. My
>>     only additional requirement would be to clearly mark that the method
>>     has been called so that we do not waste time on non-issues if
>>     calling VMInfo leads to a crash (or, even worse but less likely, to
>>     a silent data corruption that may lead to problems later). The bare
>>     minimum would be to double check that a crash during an execution of
>>     VM.info is clearly identifiable (this is worth double checking
>>     because of the use of static buffers for the error reporting, see
>>     below).
>>
>>     If an 'unsafe' VMInfo DCmd is not sufficient, you should wonder
>>     whether a safe version of it is still a "small enhancement". A JEP
>>     may be required.
>>
>>     VMError::report and the methods it calls (print_on_error) have been
>>     written with assumptions clearly stated in the comments above
>>     VMError::report (see below the two most important ones). HotSpot
>>     error reporting strategy differed from JRockit. This explains why
>>     the JRockit approach cannot be applied as is.
>>
>>     I agree with David that you can try leveraging some of the logic
>>     from VMError::report. However, to be sure there is no
>>     misunderstanding, let's be clearer about David's "(where
>>     appropriate)": we would for instance have to review all the called
>>     methods, check which ones purposefully ignore locking (potentially
>>     crashing) and provide alternate code for the unsafe ones (which may
>>     not be trivial in some cases). As a simple example, a method called
>>     "print_on_error" should not be called when we are not in an
>>     erroneous condition. However, in some components of the JVM, the low
>>     level implementations of 'print_on_error' and 'print_info_safely'
>>     could share the same code.
>>
>>     Regards,
>>
>>     Bertrand.
>>
>>     Here are the important comments about VMError::report assumptions:
>>
>>     // In general, a hang in error handler is much worse than a crash or
>>     internal
>>     // error, as it's harder to recover from a hang. Deadlock can happen
>>     if we
>>     // try to grab a lock that is already owned by current thread, or if
>> the
>>     // owner is blocked forever (e.g. in os::infinite_sleep()). If
>>     possible, the
>>     // error handler and all the functions it called should avoid
>>     grabbing any
>>     // lock. An important thing to notice is that memory allocation needs
>> a
>>     lock.
>>     //
>>     // We should avoid using large stack allocated buffers. Many errors
>>     happen
>>     // when stack space is already low. Making things even worse is that
>>     there
>>     // could be nested report_and_die() calls on stack (see above). Only
>> one
>>     // thread can report error, so large buffers are statically allocated
>> in
>>     data
>>     // segment.
>>
>>
>>
>>      >
>>      >
>>      >> Kind Regards
>>      >> /Mattis
>>      >>
>>      >> -----Original Message-----
>>      >> From: David Holmes
>>      >> Sent: den 29 oktober 2015 23:20
>>      >> To: Mattis Castegren; Coleen Phillimore;
>>      >> hotspot-runtime-dev at openjdk.java.net
>>     <mailto:hotspot-runtime-dev at openjdk.java.net>
>>      >> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>>      >> hs_err print-out
>>      >>
>>      >> On 29/10/2015 11:36 PM, Mattis Castegren wrote:
>>      >>> Some background: We have this command in JRockit. The
>>     information you
>>      >>> gather when you crash to give a good summary of what system you
>> run
>>      >>> on is pretty much exactly the information you need to get a good
>>      >>> summary on a system that has not crashed. The JRockit command is
>>      >>> extremely useful for support, and saves a lot of work going
>>     back and
>>      >>> forth asking about system information.
>>      >>>
>>      >>> Also, if we write something new in the hs_err file, like if
>>     there has
>>      >>> been any out of memory errors, we often would want the same
>>      >>> information in the VM.info output. From my experience in
>>      >>> Sustaining/Support, I can't think of any information you would
>> want
>>      >>> in VM.info that you wouldn't also want in the hs_err file and the
>>      >>> other way around, apart from details about the crash (obviously).
>>      >>>
>>      >>> I don't see a reason to exactly enumerate what information
>> VM.info
>>      >>> should provide. From a sustaining/support perspective, we want a
>>      >>> one-stop command to gather as much useful information as
>> possible,
>>      >>> which is the same idea we have for the hs_err file.
>>      >>
>>      >> The reason to enumerate what is required is to see where that
>>      >> information already exists and can be collected from. The VMError
>>      >> report has to be very careful about what it does and how it does
>> it
>>      >> because of the fact we may have crashed and the general process
>>     state
>>      >> is indeterminate. A Dcmd can simply gather whatever information is
>>      >> required from the available sources in whatever order or format
>>     desired.
>>      >>
>>      >> I have no problem with this command, just how it has been
>>     proposed to
>>      >> implement it.
>>      >>
>>      >> David
>>      >>
>>      >>> Kind Regards
>>      >>> /Mattis
>>      >>>
>>      >>> -----Original Message-----
>>      >>> From: David Holmes
>>      >>> Sent: den 29 oktober 2015 13:08
>>      >>> To: Coleen Phillimore; hotspot-runtime-dev at openjdk.java.net
>>     <mailto:hotspot-runtime-dev at openjdk.java.net>
>>      >>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>>      >>> hs_err print-out
>>      >>>
>>      >>> On 29/10/2015 10:02 PM, Coleen Phillimore wrote:
>>      >>>>
>>      >>>>
>>      >>>> On 10/29/15 7:58 AM, David Holmes wrote:
>>      >>>>> On 29/10/2015 9:26 PM, Coleen Phillimore wrote:
>>      >>>>>>
>>      >>>>>> So I actually disagree.  I don't think there should be an
>>      >>>>>> additional separate mechanism to get the same information
>>     that we
>>      >>>>>> get with hs_err reporting.  I've wanted to have a feature like
>>      >>>>>> this for a long time.
>>      >>>>>>
>>      >>>>>> I pre-reviewed this change and I thought it looked good in
>>     general.
>>      >>>>>> I didn't see the thread iteration problem that Fred refers to
>>      >>>>>> below, but I think the individual problems can be fixed.
>>      >>>>>>
>>      >>>>>> The last thing I want is this code to be copied somewhere
>> else.
>>      >>>>>
>>      >>>>> Factored as needed not copied. VMError is not an "info"
>> reporting
>>      >>>>> mechanism.
>>      >>>>
>>      >>>> If you look at the things that are reported in each "STEP",
>>     there's a
>>      >>>> small amount of code and the order is important.
>>      >>>
>>      >>> The order is less important in an "info" request I would think.
>>      >>>
>>      >>>> I'd like the vm info to use the same order and report what it
>>     can do
>>      >>>> safely.   Refactoring 5 lines of code into functions doesn't
>> make
>>      >>>> sense.
>>      >>>
>>      >>> I need to consider exactly what it is the "info" needs to report
>> in
>>      >>> more detail. There are existing facilities (system properties,
>>      >>> management
>>      >>> APIs) for various bits of runtime information, which VMError
>> can't
>>      >>> utilize but a Dcmd can.
>>      >>>
>>      >>> David
>>      >>>
>>      >>>> Coleen
>>      >>>>
>>      >>>>>
>>      >>>>> David
>>      >>>>>
>>      >>>>>> Coleen
>>      >>>>>>
>>      >>>>>> On 10/28/15 8:48 PM, David Holmes wrote:
>>      >>>>>>> I agree with Fred, this kind of info reporting should not be
>>      >>>>>>> piggy-backed onto VMError handling for the reasons stated
>>     (and the
>>      >>>>>>> error handling logic is complicated enough as it is!). For
>>     things
>>      >>>>>>> like thread lists there are already safe management
>>     functions that
>>      >>>>>>> can/should be used.
>>      >>>>>>>
>>      >>>>>>> Thanks,
>>      >>>>>>> David H.
>>      >>>>>>>
>>      >>>>>>> On 29/10/2015 3:29 AM, Frederic Parain wrote:
>>      >>>>>>>> Hi David,
>>      >>>>>>>>
>>      >>>>>>>> I haven't review all the code yet, but I'm concerned with
>> the
>>      >>>>>>>> fact that the diagnostic command is calling
>> VMError::report().
>>      >>>>>>>> This method has been implemented to be executed in the
>>     particular
>>      >>>>>>>> context of fatal errors, and its usage while the VM is
>> running
>>      >>>>>>>> normally seems dangerous.
>>      >>>>>>>>
>>      >>>>>>>> For instance, VMError::report() consciously avoids
>>     grabbing locks
>>      >>>>>>>> because of the risk of deadlock during the error reporting.
>>      >>>>>>>> The consequence is that some data structures are browsed in
>> an
>>      >>>>>>>> unsafe way. One example: VMError::report() calls
>>      >>>>>>>> Threads::print_on_error() which iterates over the thread
>> list
>>      >>>>>>>> *without owning the Threads_lock*.
>>      >>>>>>>>
>>      >>>>>>>> The implementation of the diagnostic command seems also to
>>      >>>>>>>> exclude a lot of reporting from the initial
>> VMError::report()
>>      >>>>>>>> method. Have you considered implementing a new MT-safe
>>     reporting
>>      >>>>>>>> method rather than trying to modify the special
>>     VMError::report()
>>      >>>>>>>> methods? (Note that some code factorization between
>>      >>>>>>>> VMError::report() and this new method should be possible).
>>      >>>>>>>>
>>      >>>>>>>> Thanks,
>>      >>>>>>>>
>>      >>>>>>>> Fred
>>      >>>>>>>>
>>      >>>>>>>> On 28/10/2015 17:18, david buck wrote:
>>      >>>>>>>>> Hi!
>>      >>>>>>>>>
>>      >>>>>>>>> Please review my change for this small enhancement.
>>      >>>>>>>>>
>>      >>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8027429
>>      >>>>>>>>> webrev: http://cr.openjdk.java.net/~dbuck/8027429_jdk9_01/
>>      >>>>>>>>>
>>      >>>>>>>>> Cheers,
>>      >>>>>>>>> -Buck
>>      >>>>>>>>
>>      >>>>>>
>>      >>>>
>>
>>
>>     --
>>     Bertrand Delsart,                     Grenoble Engineering Center
>>     Oracle,         180 av. de l'Europe,          ZIRST de Montbonnot
>>     38330 Montbonnot Saint Martin,                             FRANCE
>>     bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>
>>               Phone : +33 4 76 18 81 23
>>
>>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     NOTICE: This email message is for the sole use of the intended
>>     recipient(s) and may contain confidential and privileged
>>     information. Any unauthorized review, use, disclosure or
>>     distribution is prohibited. If you are not the intended recipient,
>>     please contact the sender by reply email and destroy all copies of
>>     the original message.
>>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>
>
> --
> Bertrand Delsart,                     Grenoble Engineering Center
> Oracle,         180 av. de l'Europe,          ZIRST de Montbonnot
> 38330 Montbonnot Saint Martin,                             FRANCE
> bertrand.delsart at oracle.com             Phone : +33 4 76 18 81 23
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient,
> please contact the sender by reply email and destroy all copies of
> the original message.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>