RFR: 8027429: Add diagnostic command VM.info to get hs_err print-out
Thomas Stüfe
thomas.stuefe at gmail.com
Tue Nov 3 15:51:37 UTC 2015
Hi Betrand,
I see your point and cannot really argue with them. We did spent some time
to make our version of VmError::report() safe for the "info-only" use and
live with the remaining risk.
Kind Regards, Thomas
On Tue, Nov 3, 2015 at 3:06 PM, Bertrand Delsart <
bertrand.delsart at oracle.com> wrote:
> Thanks Thomas,
>
> I fully agree it can work. The problem is that it may also fail if we are
> not careful.
>
> Fred noticed for instance that "Threads::print_on_error" is not protected
> by _vm_info_cmd (step 290). This is an obvious source of potential crashes
> if you take into account the fact that the print_on_error methods were
> designed not to use locks.
>
> It seems that the focus has been on removing what was clearly useless, not
> what might be dangerous due to how the method (and what it calls) has been
> designed.
>
> In addition, adding _vm_info_cmd is IMHO not sufficient unless you pass
> that argument to all layers. This is the biggest design issue. For instance
> all calls to print_on_error could have this additional argument so that the
> different components can adapt their behavior (and are aware there are now
> two use cases). I would start by Threads::print_on_error, grabbing the
> threads lock in the info case, but the same if true for all methods called
> by VMError::report. I would go as far as adding a comment in all
> print_on_error methods stating why the method is safe for both use cases
> (e.g. why no locks are needed) or adding "assert(!vm_info_cmd)" for methods
> VMError:report is supposed to skip. The current design, where the various
> components may not be aware that they are called for two different
> purposes, is not robust enough.
>
> Note also that being 'robust enough for the "crash" case' is not a strong
> enough guarantee for a general purpose info DCmd... particularly if, as you
> point out, an error that occurs in one of the info STEP crashes the VM
> instead of just bringing us to the next error reporting STEP. Now, as
> stated before, I could agree if we use this DCmd only when a crash is
> acceptable.
>
> Regards,
>
> Bertrand
>
>
> On 02/11/2015 14:32, Thomas Stüfe wrote:
>
>> Hi all,
>>
>> we have implemented in the SAP JVM the same mechanism. Our
>> implementation is in use since a number of years and is quite useful. So
>> I would agree with Coleen.
>>
>> However, there are some things to keep in mind when doing this:
>>
>> Error handling is different between the "info-only" and "crash" case. In
>> "info-only" case, VMError::report() is called without enclosing
>> VMError::report_and_die() and we run with the normal signal handler
>> intact, not with the secondary signal handler. So, normal signal
>> handling works, which is good. But it also means that any error inside
>> one of the error reporting STEPs will bring down the VM, instead of just
>> bringing us to the next error reporting STEP ("Error occurred during
>> error reporting").
>>
>> The problem is that a number of error reporting steps are a bit risky
>> and written with the assurance that the worst which can happen is that
>> the STEP will be interrupted. These risky steps should not be executed
>> in the "only-info" case.
>>
>> But the proposed change already does this, it introduces "_vm_info_cmd"
>> and conditionally excludes dangerous or pointless error reporting STEPs
>> (e.g. callstack is pointless).
>>
>> As far as I can see, the other concerns concern only the "crash" case
>> and simply do not apply to the "info-only" case. If an error reporting
>> step is robust enough for the "crash" case (low stack usage, not
>> allocating memory to avoid deadlocks etc), it should be safe enough for
>> the "info" case.
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Nov 2, 2015 at 12:31 PM, Mattis Castegren
>> <mattis.castegren at oracle.com <mailto:mattis.castegren at oracle.com>> wrote:
>>
>> Hi
>>
>> Of course, when I wrote "Review and push this fix" I meant doing a
>> thorough technical review. We do not want VM.info to be unsafe, and
>> I will let David reply to the technical questions that has been
>> raised. I know he did consider some of these issues, but last week
>> he was on Java one, and this week he is out. I will let him comment
>> once he gets back.
>>
>> Kind Regards
>> /Mattis
>>
>> -----Original Message-----
>> From: Bertrand Delsart
>> Sent: den 2 november 2015 10:13
>> To: David Holmes; Mattis Castegren; Coleen Phillimore;
>> hotspot-runtime-dev at openjdk.java.net
>> <mailto:hotspot-runtime-dev at openjdk.java.net>
>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>> hs_err print-out
>>
>> On 02/11/2015 02:45, David Holmes wrote:
>> > On 30/10/2015 8:38 PM, Mattis Castegren wrote:
>> >> Hi
>> >>
>> >> I agree that there may be things we could add to VM.info that we
>> >> couldn't add in an hs_err file due to the fact that we are
>> crashing.
>> >> However, I really do believe that we should use the same code. The
>> >> goal of the hs_err file is exactly the same as the goal of
>> VM.info,
>> >> to print as much information about the system as possible so that
>> >> when we look at an hs_err file or the output from VM.info, we get
>> as
>> >> much information about the system as possible. That way we can
>> reduce
>> >> the need to go back and forth with the customer to ask for more
>> information.
>> >>
>> >> If we enumerate what we need in VM.info, the list would look very
>> >> much like what we already do in VMError, and therefore I think we
>> >> should use the same code. If we find something that we would want,
>> >> that is not currently printed in hs_err files, I think we have
>> two options:
>> >>
>> >> 1) Add it to hs_err files as well. If it is interesting when
>> >> gathering customer information before a crash, it probably is
>> >> interesting after a crash
>> >> 2) If it for some reason cannot be captured when crashing, add it
>> to
>> >> the code with the condition that we don't print it during crash
>> time.
>> >>
>> >> I would therefore suggest the following plan of action
>> >>
>> >> 1) Review and push this fix to see if there are any technical
>> issues
>> >> with regards to thread safety, etc. I know David has had this in
>> mind
>> >> when doing the change. After that, we have a command that
>> support can
>> >> use.
>> >> 2) Think about if there is any information missing, both from
>> hs_err
>> >> files and from VM.info. Here, we may want to ask support about if
>> >> there is any additional information they need to ask for when they
>> >> get this data. If possible, this information should be added to
>> both
>> >> hs_err files and VM.info, we don't want support to have to ask for
>> >> VM.info if we already have an hs_err file. Of course, if there is
>> >> anything Dev would want to see when analyzing crashes in testing,
>> we
>> >> should add that too.
>> >
>> > I have no issue with VMError and VM.Info using the same code (where
>> > appropriate), but I think it wrong to try and implement VM.Info
>> > directly using the actual VMError mechanism. It is crude and makes
>> it
>> > difficult to make independent changes to either facility.
>> >
>> > Sorry.
>> >
>> > David
>>
>> Furthermore, we cannot just "Review and push this fix to see if
>> there are any technical issues". Two reviewers have already stated
>> that there are issues and these issues must be addressed before
>> pushing.
>>
>> The answer cannot just be that we need this feature, it must address
>> the technical issues that were reported.
>>
>> One possible answer is that we all agree that VMInfo would be unsafe.
>> The method would have to be renamed UnsafeVMInfo and it must be
>> clear that its 'impact' would be very risky, potentially crashing
>> the JVM.
>>
>> If this is sufficient for your needs then we could agree with it. My
>> only additional requirement would be to clearly mark that the method
>> has been called so that we do not waste time on non-issues if
>> calling VMInfo leads to a crash (or, even worse but less likely, to
>> a silent data corruption that may lead to problems later). The bare
>> minimum would be to double check that a crash during an execution of
>> VM.info is clearly identifiable (this is worth double checking
>> because of the use of static buffers for the error reporting, see
>> below).
>>
>> If an 'unsafe' VMInfo DCmd is not sufficient, you should wonder
>> whether a safe version of it is still a "small enhancement". A JEP
>> may be required.
>>
>> VMError::report and the methods it calls (print_on_error) have been
>> written with assumptions clearly stated in the comments above
>> VMError::report (see below the two most important ones). HotSpot
>> error reporting strategy differed from JRockit. This explains why
>> the JRockit approach cannot be applied as is.
>>
>> I agree with David that you can try leveraging some of the logic
>> from VMError::report. However, to be sure there is no
>> misunderstanding, let's be clearer about David's "(where
>> appropriate)": we would for instance have to review all the called
>> methods, check which ones purposefully ignore locking (potentially
>> crashing) and provide alternate code for the unsafe ones (which may
>> not be trivial in some cases). As a simple example, a method called
>> "print_on_error" should not be called when we are not in an
>> erroneous condition. However, in some components of the JVM, the low
>> level implementations of 'print_on_error' and 'print_info_safely'
>> could share the same code.
>>
>> Regards,
>>
>> Bertrand.
>>
>> Here are the important comments about VMError::report assumptions:
>>
>> // In general, a hang in error handler is much worse than a crash or
>> internal
>> // error, as it's harder to recover from a hang. Deadlock can happen
>> if we
>> // try to grab a lock that is already owned by current thread, or if
>> the
>> // owner is blocked forever (e.g. in os::infinite_sleep()). If
>> possible, the
>> // error handler and all the functions it called should avoid
>> grabbing any
>> // lock. An important thing to notice is that memory allocation needs
>> a
>> lock.
>> //
>> // We should avoid using large stack allocated buffers. Many errors
>> happen
>> // when stack space is already low. Making things even worse is that
>> there
>> // could be nested report_and_die() calls on stack (see above). Only
>> one
>> // thread can report error, so large buffers are statically allocated
>> in
>> data
>> // segment.
>>
>>
>>
>> >
>> >
>> >> Kind Regards
>> >> /Mattis
>> >>
>> >> -----Original Message-----
>> >> From: David Holmes
>> >> Sent: den 29 oktober 2015 23:20
>> >> To: Mattis Castegren; Coleen Phillimore;
>> >> hotspot-runtime-dev at openjdk.java.net
>> <mailto:hotspot-runtime-dev at openjdk.java.net>
>> >> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>> >> hs_err print-out
>> >>
>> >> On 29/10/2015 11:36 PM, Mattis Castegren wrote:
>> >>> Some background: We have this command in JRockit. The
>> information you
>> >>> gather when you crash to give a good summary of what system you
>> run
>> >>> on is pretty much exactly the information you need to get a good
>> >>> summary on a system that has not crashed. The JRockit command is
>> >>> extremely useful for support, and saves a lot of work going
>> back and
>> >>> forth asking about system information.
>> >>>
>> >>> Also, if we write something new in the hs_err file, like if
>> there has
>> >>> been any out of memory errors, we often would want the same
>> >>> information in the VM.info output. From my experience in
>> >>> Sustaining/Support, I can't think of any information you would
>> want
>> >>> in VM.info that you wouldn't also want in the hs_err file and the
>> >>> other way around, apart from details about the crash (obviously).
>> >>>
>> >>> I don't see a reason to exactly enumerate what information
>> VM.info
>> >>> should provide. From a sustaining/support perspective, we want a
>> >>> one-stop command to gather as much useful information as
>> possible,
>> >>> which is the same idea we have for the hs_err file.
>> >>
>> >> The reason to enumerate what is required is to see where that
>> >> information already exists and can be collected from. The VMError
>> >> report has to be very careful about what it does and how it does
>> it
>> >> because of the fact we may have crashed and the general process
>> state
>> >> is indeterminate. A Dcmd can simply gather whatever information is
>> >> required from the available sources in whatever order or format
>> desired.
>> >>
>> >> I have no problem with this command, just how it has been
>> proposed to
>> >> implement it.
>> >>
>> >> David
>> >>
>> >>> Kind Regards
>> >>> /Mattis
>> >>>
>> >>> -----Original Message-----
>> >>> From: David Holmes
>> >>> Sent: den 29 oktober 2015 13:08
>> >>> To: Coleen Phillimore; hotspot-runtime-dev at openjdk.java.net
>> <mailto:hotspot-runtime-dev at openjdk.java.net>
>> >>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>> >>> hs_err print-out
>> >>>
>> >>> On 29/10/2015 10:02 PM, Coleen Phillimore wrote:
>> >>>>
>> >>>>
>> >>>> On 10/29/15 7:58 AM, David Holmes wrote:
>> >>>>> On 29/10/2015 9:26 PM, Coleen Phillimore wrote:
>> >>>>>>
>> >>>>>> So I actually disagree. I don't think there should be an
>> >>>>>> additional separate mechanism to get the same information
>> that we
>> >>>>>> get with hs_err reporting. I've wanted to have a feature like
>> >>>>>> this for a long time.
>> >>>>>>
>> >>>>>> I pre-reviewed this change and I thought it looked good in
>> general.
>> >>>>>> I didn't see the thread iteration problem that Fred refers to
>> >>>>>> below, but I think the individual problems can be fixed.
>> >>>>>>
>> >>>>>> The last thing I want is this code to be copied somewhere
>> else.
>> >>>>>
>> >>>>> Factored as needed not copied. VMError is not an "info"
>> reporting
>> >>>>> mechanism.
>> >>>>
>> >>>> If you look at the things that are reported in each "STEP",
>> there's a
>> >>>> small amount of code and the order is important.
>> >>>
>> >>> The order is less important in an "info" request I would think.
>> >>>
>> >>>> I'd like the vm info to use the same order and report what it
>> can do
>> >>>> safely. Refactoring 5 lines of code into functions doesn't
>> make
>> >>>> sense.
>> >>>
>> >>> I need to consider exactly what it is the "info" needs to report
>> in
>> >>> more detail. There are existing facilities (system properties,
>> >>> management
>> >>> APIs) for various bits of runtime information, which VMError
>> can't
>> >>> utilize but a Dcmd can.
>> >>>
>> >>> David
>> >>>
>> >>>> Coleen
>> >>>>
>> >>>>>
>> >>>>> David
>> >>>>>
>> >>>>>> Coleen
>> >>>>>>
>> >>>>>> On 10/28/15 8:48 PM, David Holmes wrote:
>> >>>>>>> I agree with Fred, this kind of info reporting should not be
>> >>>>>>> piggy-backed onto VMError handling for the reasons stated
>> (and the
>> >>>>>>> error handling logic is complicated enough as it is!). For
>> things
>> >>>>>>> like thread lists there are already safe management
>> functions that
>> >>>>>>> can/should be used.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> David H.
>> >>>>>>>
>> >>>>>>> On 29/10/2015 3:29 AM, Frederic Parain wrote:
>> >>>>>>>> Hi David,
>> >>>>>>>>
>> >>>>>>>> I haven't review all the code yet, but I'm concerned with
>> the
>> >>>>>>>> fact that the diagnostic command is calling
>> VMError::report().
>> >>>>>>>> This method has been implemented to be executed in the
>> particular
>> >>>>>>>> context of fatal errors, and its usage while the VM is
>> running
>> >>>>>>>> normally seems dangerous.
>> >>>>>>>>
>> >>>>>>>> For instance, VMError::report() consciously avoids
>> grabbing locks
>> >>>>>>>> because of the risk of deadlock during the error reporting.
>> >>>>>>>> The consequence is that some data structures are browsed in
>> an
>> >>>>>>>> unsafe way. One example: VMError::report() calls
>> >>>>>>>> Threads::print_on_error() which iterates over the thread
>> list
>> >>>>>>>> *without owning the Threads_lock*.
>> >>>>>>>>
>> >>>>>>>> The implementation of the diagnostic command seems also to
>> >>>>>>>> exclude a lot of reporting from the initial
>> VMError::report()
>> >>>>>>>> method. Have you considered implementing a new MT-safe
>> reporting
>> >>>>>>>> method rather than trying to modify the special
>> VMError::report()
>> >>>>>>>> methods? (Note that some code factorization between
>> >>>>>>>> VMError::report() and this new method should be possible).
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>>
>> >>>>>>>> Fred
>> >>>>>>>>
>> >>>>>>>> On 28/10/2015 17:18, david buck wrote:
>> >>>>>>>>> Hi!
>> >>>>>>>>>
>> >>>>>>>>> Please review my change for this small enhancement.
>> >>>>>>>>>
>> >>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8027429
>> >>>>>>>>> webrev: http://cr.openjdk.java.net/~dbuck/8027429_jdk9_01/
>> >>>>>>>>>
>> >>>>>>>>> Cheers,
>> >>>>>>>>> -Buck
>> >>>>>>>>
>> >>>>>>
>> >>>>
>>
>>
>> --
>> Bertrand Delsart, Grenoble Engineering Center
>> Oracle, 180 av. de l'Europe, ZIRST de Montbonnot
>> 38330 Montbonnot Saint Martin, FRANCE
>> bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>
>> Phone : +33 4 76 18 81 23
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> NOTICE: This email message is for the sole use of the intended
>> recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or
>> distribution is prohibited. If you are not the intended recipient,
>> please contact the sender by reply email and destroy all copies of
>> the original message.
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>
>
> --
> Bertrand Delsart, Grenoble Engineering Center
> Oracle, 180 av. de l'Europe, ZIRST de Montbonnot
> 38330 Montbonnot Saint Martin, FRANCE
> bertrand.delsart at oracle.com Phone : +33 4 76 18 81 23
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient,
> please contact the sender by reply email and destroy all copies of
> the original message.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
More information about the hotspot-runtime-dev
mailing list