RFR: 8027429: Add diagnostic command VM.info to get hs_err print-out
Coleen Phillimore
coleen.phillimore at oracle.com
Tue Nov 3 15:25:15 UTC 2015
Hi,
I've done a bit of refactoring with this change and have tested it with
and without intentional crashes to the VM.info reporting. If it crashes
within VM.info, there's a stack trace that clearly shows where it comes
from in the hs_err file. And the hs_err_pid file is correctly
produced. If we can make the jcmd come with a warning that it *could*
potentially crash the vm, I think this feature is a good addition for
sustaining.
Without crashing, VM.info gives a lot of information that we normally
give with the hs_err file. The reason for the careful STEP'ping in
hs_err production is that the JVM has already crashed so we can't count
on anything. The VM.info command is designed to be run on a moderately
stable JVM so is less likely to crash on the things that we're printing out.
I think we should fix the Threads walking code that Fred pointed out to
take the Threads_lock, but I don't think we should *copy* code for
VM.info or copy print_on_error functions.
Thanks,
Coleen
On 11/3/15 10:18 AM, Bertrand Delsart wrote:
> On 03/11/2015 15:06, Bertrand Delsart wrote:
>> Thanks Thomas,
>>
>> I fully agree it can work. The problem is that it may also fail if we
>> are not careful.
>>
>> Fred noticed for instance that "Threads::print_on_error" is not
>> protected by _vm_info_cmd (step 290). This is an obvious source of
>> potential crashes if you take into account the fact that the
>> print_on_error methods were designed not to use locks.
>
> Fred just pointed out that this may be OK since _thread is NULL.
>
> I'm not sure of why we do not print the threads when there is notion
> of current thread (and I would wonder whether we should print them for
> VMInfo) but luckily that may have been sufficient to prevent the crash.
>
> However, I still do not like it from a design point of view for the
> other reasons stated in my e-mail.
>
> Bertrand.
>
>>
>> It seems that the focus has been on removing what was clearly useless,
>> not what might be dangerous due to how the method (and what it calls)
>> has been designed.
>>
>> In addition, adding _vm_info_cmd is IMHO not sufficient unless you pass
>> that argument to all layers. This is the biggest design issue. For
>> instance all calls to print_on_error could have this additional argument
>> so that the different components can adapt their behavior (and are aware
>> there are now two use cases). I would start by Threads::print_on_error,
>> grabbing the threads lock in the info case, but the same if true for all
>> methods called by VMError::report. I would go as far as adding a comment
>> in all print_on_error methods stating why the method is safe for both
>> use cases (e.g. why no locks are needed) or adding
>> "assert(!vm_info_cmd)" for methods VMError:report is supposed to skip.
>> The current design, where the various components may not be aware that
>> they are called for two different purposes, is not robust enough.
>>
>> Note also that being 'robust enough for the "crash" case' is not a
>> strong enough guarantee for a general purpose info DCmd... particularly
>> if, as you point out, an error that occurs in one of the info STEP
>> crashes the VM instead of just bringing us to the next error reporting
>> STEP. Now, as stated before, I could agree if we use this DCmd only when
>> a crash is acceptable.
>>
>> Regards,
>>
>> Bertrand
>>
>> On 02/11/2015 14:32, Thomas Stüfe wrote:
>>> Hi all,
>>>
>>> we have implemented in the SAP JVM the same mechanism. Our
>>> implementation is in use since a number of years and is quite
>>> useful. So
>>> I would agree with Coleen.
>>>
>>> However, there are some things to keep in mind when doing this:
>>>
>>> Error handling is different between the "info-only" and "crash"
>>> case. In
>>> "info-only" case, VMError::report() is called without enclosing
>>> VMError::report_and_die() and we run with the normal signal handler
>>> intact, not with the secondary signal handler. So, normal signal
>>> handling works, which is good. But it also means that any error inside
>>> one of the error reporting STEPs will bring down the VM, instead of
>>> just
>>> bringing us to the next error reporting STEP ("Error occurred during
>>> error reporting").
>>>
>>> The problem is that a number of error reporting steps are a bit risky
>>> and written with the assurance that the worst which can happen is that
>>> the STEP will be interrupted. These risky steps should not be executed
>>> in the "only-info" case.
>>>
>>> But the proposed change already does this, it introduces "_vm_info_cmd"
>>> and conditionally excludes dangerous or pointless error reporting STEPs
>>> (e.g. callstack is pointless).
>>>
>>> As far as I can see, the other concerns concern only the "crash" case
>>> and simply do not apply to the "info-only" case. If an error reporting
>>> step is robust enough for the "crash" case (low stack usage, not
>>> allocating memory to avoid deadlocks etc), it should be safe enough for
>>> the "info" case.
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Nov 2, 2015 at 12:31 PM, Mattis Castegren
>>> <mattis.castegren at oracle.com <mailto:mattis.castegren at oracle.com>>
>>> wrote:
>>>
>>> Hi
>>>
>>> Of course, when I wrote "Review and push this fix" I meant doing a
>>> thorough technical review. We do not want VM.info to be unsafe, and
>>> I will let David reply to the technical questions that has been
>>> raised. I know he did consider some of these issues, but last week
>>> he was on Java one, and this week he is out. I will let him comment
>>> once he gets back.
>>>
>>> Kind Regards
>>> /Mattis
>>>
>>> -----Original Message-----
>>> From: Bertrand Delsart
>>> Sent: den 2 november 2015 10:13
>>> To: David Holmes; Mattis Castegren; Coleen Phillimore;
>>> hotspot-runtime-dev at openjdk.java.net
>>> <mailto:hotspot-runtime-dev at openjdk.java.net>
>>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>>> hs_err print-out
>>>
>>> On 02/11/2015 02:45, David Holmes wrote:
>>> > On 30/10/2015 8:38 PM, Mattis Castegren wrote:
>>> >> Hi
>>> >>
>>> >> I agree that there may be things we could add to VM.info
>>> that we
>>> >> couldn't add in an hs_err file due to the fact that we are
>>> crashing.
>>> >> However, I really do believe that we should use the same code.
>>> The
>>> >> goal of the hs_err file is exactly the same as the goal of
>>> VM.info,
>>> >> to print as much information about the system as possible so
>>> that
>>> >> when we look at an hs_err file or the output from VM.info, we
>>> get as
>>> >> much information about the system as possible. That way we can
>>> reduce
>>> >> the need to go back and forth with the customer to ask for more
>>> information.
>>> >>
>>> >> If we enumerate what we need in VM.info, the list would look
>>> very
>>> >> much like what we already do in VMError, and therefore I
>>> think we
>>> >> should use the same code. If we find something that we would
>>> want,
>>> >> that is not currently printed in hs_err files, I think we have
>>> two options:
>>> >>
>>> >> 1) Add it to hs_err files as well. If it is interesting when
>>> >> gathering customer information before a crash, it probably is
>>> >> interesting after a crash
>>> >> 2) If it for some reason cannot be captured when crashing, add
>>> it to
>>> >> the code with the condition that we don't print it during crash
>>> time.
>>> >>
>>> >> I would therefore suggest the following plan of action
>>> >>
>>> >> 1) Review and push this fix to see if there are any technical
>>> issues
>>> >> with regards to thread safety, etc. I know David has had
>>> this in
>>> mind
>>> >> when doing the change. After that, we have a command that
>>> support can
>>> >> use.
>>> >> 2) Think about if there is any information missing, both from
>>> hs_err
>>> >> files and from VM.info. Here, we may want to ask support
>>> about if
>>> >> there is any additional information they need to ask for when
>>> they
>>> >> get this data. If possible, this information should be added
>>> to both
>>> >> hs_err files and VM.info, we don't want support to have to ask
>>> for
>>> >> VM.info if we already have an hs_err file. Of course, if
>>> there is
>>> >> anything Dev would want to see when analyzing crashes in
>>> testing, we
>>> >> should add that too.
>>> >
>>> > I have no issue with VMError and VM.Info using the same code
>>> (where
>>> > appropriate), but I think it wrong to try and implement VM.Info
>>> > directly using the actual VMError mechanism. It is crude and
>>> makes it
>>> > difficult to make independent changes to either facility.
>>> >
>>> > Sorry.
>>> >
>>> > David
>>>
>>> Furthermore, we cannot just "Review and push this fix to see if
>>> there are any technical issues". Two reviewers have already stated
>>> that there are issues and these issues must be addressed before
>>> pushing.
>>>
>>> The answer cannot just be that we need this feature, it must
>>> address
>>> the technical issues that were reported.
>>>
>>> One possible answer is that we all agree that VMInfo would be
>>> unsafe.
>>> The method would have to be renamed UnsafeVMInfo and it must be
>>> clear that its 'impact' would be very risky, potentially crashing
>>> the JVM.
>>>
>>> If this is sufficient for your needs then we could agree with
>>> it. My
>>> only additional requirement would be to clearly mark that the
>>> method
>>> has been called so that we do not waste time on non-issues if
>>> calling VMInfo leads to a crash (or, even worse but less likely, to
>>> a silent data corruption that may lead to problems later). The bare
>>> minimum would be to double check that a crash during an
>>> execution of
>>> VM.info is clearly identifiable (this is worth double checking
>>> because of the use of static buffers for the error reporting, see
>>> below).
>>>
>>> If an 'unsafe' VMInfo DCmd is not sufficient, you should wonder
>>> whether a safe version of it is still a "small enhancement". A JEP
>>> may be required.
>>>
>>> VMError::report and the methods it calls (print_on_error) have been
>>> written with assumptions clearly stated in the comments above
>>> VMError::report (see below the two most important ones). HotSpot
>>> error reporting strategy differed from JRockit. This explains why
>>> the JRockit approach cannot be applied as is.
>>>
>>> I agree with David that you can try leveraging some of the logic
>>> from VMError::report. However, to be sure there is no
>>> misunderstanding, let's be clearer about David's "(where
>>> appropriate)": we would for instance have to review all the called
>>> methods, check which ones purposefully ignore locking (potentially
>>> crashing) and provide alternate code for the unsafe ones (which may
>>> not be trivial in some cases). As a simple example, a method called
>>> "print_on_error" should not be called when we are not in an
>>> erroneous condition. However, in some components of the JVM, the
>>> low
>>> level implementations of 'print_on_error' and 'print_info_safely'
>>> could share the same code.
>>>
>>> Regards,
>>>
>>> Bertrand.
>>>
>>> Here are the important comments about VMError::report assumptions:
>>>
>>> // In general, a hang in error handler is much worse than a
>>> crash or
>>> internal
>>> // error, as it's harder to recover from a hang. Deadlock can
>>> happen
>>> if we
>>> // try to grab a lock that is already owned by current thread, or
>>> if the
>>> // owner is blocked forever (e.g. in os::infinite_sleep()). If
>>> possible, the
>>> // error handler and all the functions it called should avoid
>>> grabbing any
>>> // lock. An important thing to notice is that memory allocation
>>> needs a
>>> lock.
>>> //
>>> // We should avoid using large stack allocated buffers. Many errors
>>> happen
>>> // when stack space is already low. Making things even worse is
>>> that
>>> there
>>> // could be nested report_and_die() calls on stack (see above).
>>> Only one
>>> // thread can report error, so large buffers are statically
>>> allocated in
>>> data
>>> // segment.
>>>
>>>
>>>
>>> >
>>> >
>>> >> Kind Regards
>>> >> /Mattis
>>> >>
>>> >> -----Original Message-----
>>> >> From: David Holmes
>>> >> Sent: den 29 oktober 2015 23:20
>>> >> To: Mattis Castegren; Coleen Phillimore;
>>> >> hotspot-runtime-dev at openjdk.java.net
>>> <mailto:hotspot-runtime-dev at openjdk.java.net>
>>> >> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to
>>> get
>>> >> hs_err print-out
>>> >>
>>> >> On 29/10/2015 11:36 PM, Mattis Castegren wrote:
>>> >>> Some background: We have this command in JRockit. The
>>> information you
>>> >>> gather when you crash to give a good summary of what system
>>> you run
>>> >>> on is pretty much exactly the information you need to get a
>>> good
>>> >>> summary on a system that has not crashed. The JRockit
>>> command is
>>> >>> extremely useful for support, and saves a lot of work going
>>> back and
>>> >>> forth asking about system information.
>>> >>>
>>> >>> Also, if we write something new in the hs_err file, like if
>>> there has
>>> >>> been any out of memory errors, we often would want the same
>>> >>> information in the VM.info output. From my experience in
>>> >>> Sustaining/Support, I can't think of any information you
>>> would want
>>> >>> in VM.info that you wouldn't also want in the hs_err file and
>>> the
>>> >>> other way around, apart from details about the crash
>>> (obviously).
>>> >>>
>>> >>> I don't see a reason to exactly enumerate what information
>>> VM.info
>>> >>> should provide. From a sustaining/support perspective, we
>>> want a
>>> >>> one-stop command to gather as much useful information as
>>> possible,
>>> >>> which is the same idea we have for the hs_err file.
>>> >>
>>> >> The reason to enumerate what is required is to see where that
>>> >> information already exists and can be collected from. The
>>> VMError
>>> >> report has to be very careful about what it does and how it
>>> does it
>>> >> because of the fact we may have crashed and the general process
>>> state
>>> >> is indeterminate. A Dcmd can simply gather whatever
>>> information is
>>> >> required from the available sources in whatever order or format
>>> desired.
>>> >>
>>> >> I have no problem with this command, just how it has been
>>> proposed to
>>> >> implement it.
>>> >>
>>> >> David
>>> >>
>>> >>> Kind Regards
>>> >>> /Mattis
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: David Holmes
>>> >>> Sent: den 29 oktober 2015 13:08
>>> >>> To: Coleen Phillimore; hotspot-runtime-dev at openjdk.java.net
>>> <mailto:hotspot-runtime-dev at openjdk.java.net>
>>> >>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info
>>> to get
>>> >>> hs_err print-out
>>> >>>
>>> >>> On 29/10/2015 10:02 PM, Coleen Phillimore wrote:
>>> >>>>
>>> >>>>
>>> >>>> On 10/29/15 7:58 AM, David Holmes wrote:
>>> >>>>> On 29/10/2015 9:26 PM, Coleen Phillimore wrote:
>>> >>>>>>
>>> >>>>>> So I actually disagree. I don't think there should be an
>>> >>>>>> additional separate mechanism to get the same information
>>> that we
>>> >>>>>> get with hs_err reporting. I've wanted to have a feature
>>> like
>>> >>>>>> this for a long time.
>>> >>>>>>
>>> >>>>>> I pre-reviewed this change and I thought it looked good in
>>> general.
>>> >>>>>> I didn't see the thread iteration problem that Fred
>>> refers to
>>> >>>>>> below, but I think the individual problems can be fixed.
>>> >>>>>>
>>> >>>>>> The last thing I want is this code to be copied somewhere
>>> else.
>>> >>>>>
>>> >>>>> Factored as needed not copied. VMError is not an "info"
>>> reporting
>>> >>>>> mechanism.
>>> >>>>
>>> >>>> If you look at the things that are reported in each "STEP",
>>> there's a
>>> >>>> small amount of code and the order is important.
>>> >>>
>>> >>> The order is less important in an "info" request I would
>>> think.
>>> >>>
>>> >>>> I'd like the vm info to use the same order and report what it
>>> can do
>>> >>>> safely. Refactoring 5 lines of code into functions doesn't
>>> make
>>> >>>> sense.
>>> >>>
>>> >>> I need to consider exactly what it is the "info" needs to
>>> report in
>>> >>> more detail. There are existing facilities (system properties,
>>> >>> management
>>> >>> APIs) for various bits of runtime information, which VMError
>>> can't
>>> >>> utilize but a Dcmd can.
>>> >>>
>>> >>> David
>>> >>>
>>> >>>> Coleen
>>> >>>>
>>> >>>>>
>>> >>>>> David
>>> >>>>>
>>> >>>>>> Coleen
>>> >>>>>>
>>> >>>>>> On 10/28/15 8:48 PM, David Holmes wrote:
>>> >>>>>>> I agree with Fred, this kind of info reporting should
>>> not be
>>> >>>>>>> piggy-backed onto VMError handling for the reasons stated
>>> (and the
>>> >>>>>>> error handling logic is complicated enough as it is!). For
>>> things
>>> >>>>>>> like thread lists there are already safe management
>>> functions that
>>> >>>>>>> can/should be used.
>>> >>>>>>>
>>> >>>>>>> Thanks,
>>> >>>>>>> David H.
>>> >>>>>>>
>>> >>>>>>> On 29/10/2015 3:29 AM, Frederic Parain wrote:
>>> >>>>>>>> Hi David,
>>> >>>>>>>>
>>> >>>>>>>> I haven't review all the code yet, but I'm concerned
>>> with the
>>> >>>>>>>> fact that the diagnostic command is calling
>>> VMError::report().
>>> >>>>>>>> This method has been implemented to be executed in the
>>> particular
>>> >>>>>>>> context of fatal errors, and its usage while the VM is
>>> running
>>> >>>>>>>> normally seems dangerous.
>>> >>>>>>>>
>>> >>>>>>>> For instance, VMError::report() consciously avoids
>>> grabbing locks
>>> >>>>>>>> because of the risk of deadlock during the error
>>> reporting.
>>> >>>>>>>> The consequence is that some data structures are browsed
>>> in an
>>> >>>>>>>> unsafe way. One example: VMError::report() calls
>>> >>>>>>>> Threads::print_on_error() which iterates over the thread
>>> list
>>> >>>>>>>> *without owning the Threads_lock*.
>>> >>>>>>>>
>>> >>>>>>>> The implementation of the diagnostic command seems
>>> also to
>>> >>>>>>>> exclude a lot of reporting from the initial
>>> VMError::report()
>>> >>>>>>>> method. Have you considered implementing a new MT-safe
>>> reporting
>>> >>>>>>>> method rather than trying to modify the special
>>> VMError::report()
>>> >>>>>>>> methods? (Note that some code factorization between
>>> >>>>>>>> VMError::report() and this new method should be
>>> possible).
>>> >>>>>>>>
>>> >>>>>>>> Thanks,
>>> >>>>>>>>
>>> >>>>>>>> Fred
>>> >>>>>>>>
>>> >>>>>>>> On 28/10/2015 17:18, david buck wrote:
>>> >>>>>>>>> Hi!
>>> >>>>>>>>>
>>> >>>>>>>>> Please review my change for this small enhancement.
>>> >>>>>>>>>
>>> >>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8027429
>>> >>>>>>>>> webrev:
>>> http://cr.openjdk.java.net/~dbuck/8027429_jdk9_01/
>>> >>>>>>>>>
>>> >>>>>>>>> Cheers,
>>> >>>>>>>>> -Buck
>>> >>>>>>>>
>>> >>>>>>
>>> >>>>
>>>
>>>
>>> --
>>> Bertrand Delsart, Grenoble Engineering Center
>>> Oracle, 180 av. de l'Europe, ZIRST de Montbonnot
>>> 38330 Montbonnot Saint Martin, FRANCE
>>> bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>
>>> Phone : +33 4 76 18 81 23
>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> NOTICE: This email message is for the sole use of the intended
>>> recipient(s) and may contain confidential and privileged
>>> information. Any unauthorized review, use, disclosure or
>>> distribution is prohibited. If you are not the intended recipient,
>>> please contact the sender by reply email and destroy all copies of
>>> the original message.
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>
>>
>
>
More information about the hotspot-runtime-dev
mailing list