RFR: 8027429: Add diagnostic command VM.info to get hs_err print-out

Tue Nov 3 15:44:17 UTC 2015

Bertrand,

My few cents.

> One possible answer is that we all agree that VMInfo would be unsafe.
> The method would have to be renamed UnsafeVMInfo and it must be clear
> that its 'impact' would be very risky, potentially crashing the JVM.

There is no usecases for "unsafe" method. If we are allowed to crash the
app, we can just kill it and get hs_err but if we are not allowed to
crash the app we can't run anything that leave it in unpredictable
state. What the worse - "unsafe" method might become a root cause for a
crash that happens couple of days (or years) later.

I'm also against of direct use of error handler - it means that all
changes we do for VM.info should be safe on crash and it limits the
information we can work with so if we need VM.info dcmd urgently, we
should copy this code and work with a copy.

But, I would prefer to:

a) Have a JEP for this work

b) Refactor existing error handle the way that all safe code could be
re-used

c) Change crash handler to use shared code when it possible

d) Create a VM.Info dcmd

-Dmitry

On 2015-11-02 12:13, Bertrand Delsart wrote:
> On 02/11/2015 02:45, David Holmes wrote:
>> On 30/10/2015 8:38 PM, Mattis Castegren wrote:
>>> Hi
>>>
>>> I agree that there may be things we could add to VM.info that we
>>> couldn't add in an hs_err file due to the fact that we are crashing.
>>> However, I really do believe that we should use the same code. The
>>> goal of the hs_err file is exactly the same as the goal of VM.info, to
>>> print as much information about the system as possible so that when we
>>> look at an hs_err file or the output from VM.info, we get as much
>>> information about the system as possible. That way we can reduce the
>>> need to go back and forth with the customer to ask for more information.
>>>
>>> If we enumerate what we need in VM.info, the list would look very much
>>> like what we already do in VMError, and therefore I think we should
>>> use the same code. If we find something that we would want, that is
>>> not currently printed in hs_err files, I think we have two options:
>>>
>>> 1) Add it to hs_err files as well. If it is interesting when gathering
>>> customer information before a crash, it probably is interesting after
>>> a crash
>>> 2) If it for some reason cannot be captured when crashing, add it to
>>> the code with the condition that we don't print it during crash time.
>>>
>>> I would therefore suggest the following plan of action
>>>
>>> 1) Review and push this fix to see if there are any technical issues
>>> with regards to thread safety, etc. I know David has had this in mind
>>> when doing the change. After that, we have a command that support can
>>> use.
>>> 2) Think about if there is any information missing, both from hs_err
>>> files and from VM.info. Here, we may want to ask support about if
>>> there is any additional information they need to ask for when they get
>>> this data. If possible, this information should be added to both
>>> hs_err files and VM.info, we don't want support to have to ask for
>>> VM.info if we already have an hs_err file. Of course, if there is
>>> anything Dev would want to see when analyzing crashes in testing, we
>>> should add that too.
>>
>> I have no issue with VMError and VM.Info using the same code (where
>> appropriate), but I think it wrong to try and implement VM.Info directly
>> using the actual VMError mechanism. It is crude and makes it difficult
>> to make independent changes to either facility.
>>
>> Sorry.
>>
>> David
> 
> Furthermore, we cannot just "Review and push this fix to see if there
> are any technical issues". Two reviewers have already stated that there
> are issues and these issues must be addressed before pushing.
> 
> The answer cannot just be that we need this feature, it must address the
> technical issues that were reported.
> 
> One possible answer is that we all agree that VMInfo would be unsafe.
> The method would have to be renamed UnsafeVMInfo and it must be clear
> that its 'impact' would be very risky, potentially crashing the JVM.
> 
> If this is sufficient for your needs then we could agree with it. My
> only additional requirement would be to clearly mark that the method has
> been called so that we do not waste time on non-issues if calling VMInfo
> leads to a crash (or, even worse but less likely, to a silent data
> corruption that may lead to problems later). The bare minimum would be
> to double check that a crash during an execution of VM.info is clearly
> identifiable (this is worth double checking because of the use of static
> buffers for the error reporting, see below).
> 
> If an 'unsafe' VMInfo DCmd is not sufficient, you should wonder whether
> a safe version of it is still a "small enhancement". A JEP may be required.
> 
> VMError::report and the methods it calls (print_on_error) have been
> written with assumptions clearly stated in the comments above
> VMError::report (see below the two most important ones). HotSpot error
> reporting strategy differed from JRockit. This explains why the JRockit
> approach cannot be applied as is.
> 
> I agree with David that you can try leveraging some of the logic from
> VMError::report. However, to be sure there is no misunderstanding, let's
> be clearer about David's "(where appropriate)": we would for instance
> have to review all the called methods, check which ones purposefully
> ignore locking (potentially crashing) and provide alternate code for the
> unsafe ones (which may not be trivial in some cases). As a simple
> example, a method called "print_on_error" should not be called when we
> are not in an erroneous condition. However, in some components of the
> JVM, the low level implementations of 'print_on_error' and
> 'print_info_safely' could share the same code.
> 
> Regards,
> 
> Bertrand.
> 
> Here are the important comments about VMError::report assumptions:
> 
> // In general, a hang in error handler is much worse than a crash or
> internal
> // error, as it's harder to recover from a hang. Deadlock can happen if we
> // try to grab a lock that is already owned by current thread, or if the
> // owner is blocked forever (e.g. in os::infinite_sleep()). If possible,
> the
> // error handler and all the functions it called should avoid grabbing any
> // lock. An important thing to notice is that memory allocation needs a
> lock.
> //
> // We should avoid using large stack allocated buffers. Many errors happen
> // when stack space is already low. Making things even worse is that there
> // could be nested report_and_die() calls on stack (see above). Only one
> // thread can report error, so large buffers are statically allocated in
> data
> // segment.
> 
> 
> 
>>
>>
>>> Kind Regards
>>> /Mattis
>>>
>>> -----Original Message-----
>>> From: David Holmes
>>> Sent: den 29 oktober 2015 23:20
>>> To: Mattis Castegren; Coleen Phillimore;
>>> hotspot-runtime-dev at openjdk.java.net
>>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>>> hs_err print-out
>>>
>>> On 29/10/2015 11:36 PM, Mattis Castegren wrote:
>>>> Some background: We have this command in JRockit. The information you
>>>> gather when you crash to give a good summary of what system you run
>>>> on is pretty much exactly the information you need to get a good
>>>> summary on a system that has not crashed. The JRockit command is
>>>> extremely useful for support, and saves a lot of work going back and
>>>> forth asking about system information.
>>>>
>>>> Also, if we write something new in the hs_err file, like if there has
>>>> been any out of memory errors, we often would want the same
>>>> information in the VM.info output. From my experience in
>>>> Sustaining/Support, I can't think of any information you would want
>>>> in VM.info that you wouldn't also want in the hs_err file and the
>>>> other way around, apart from details about the crash (obviously).
>>>>
>>>> I don't see a reason to exactly enumerate what information VM.info
>>>> should provide. From a sustaining/support perspective, we want a
>>>> one-stop command to gather as much useful information as possible,
>>>> which is the same idea we have for the hs_err file.
>>>
>>> The reason to enumerate what is required is to see where that
>>> information already exists and can be collected from. The VMError
>>> report has to be very careful about what it does and how it does it
>>> because of the fact we may have crashed and the general process state
>>> is indeterminate. A Dcmd can simply gather whatever information is
>>> required from the available sources in whatever order or format desired.
>>>
>>> I have no problem with this command, just how it has been proposed to
>>> implement it.
>>>
>>> David
>>>
>>>> Kind Regards
>>>> /Mattis
>>>>
>>>> -----Original Message-----
>>>> From: David Holmes
>>>> Sent: den 29 oktober 2015 13:08
>>>> To: Coleen Phillimore; hotspot-runtime-dev at openjdk.java.net
>>>> Subject: Re: RFR: 8027429: Add diagnostic command VM.info to get
>>>> hs_err print-out
>>>>
>>>> On 29/10/2015 10:02 PM, Coleen Phillimore wrote:
>>>>>
>>>>>
>>>>> On 10/29/15 7:58 AM, David Holmes wrote:
>>>>>> On 29/10/2015 9:26 PM, Coleen Phillimore wrote:
>>>>>>>
>>>>>>> So I actually disagree.  I don't think there should be an
>>>>>>> additional separate mechanism to get the same information that we
>>>>>>> get with hs_err reporting.  I've wanted to have a feature like
>>>>>>> this for a long time.
>>>>>>>
>>>>>>> I pre-reviewed this change and I thought it looked good in general.
>>>>>>> I didn't see the thread iteration problem that Fred refers to
>>>>>>> below, but I think the individual problems can be fixed.
>>>>>>>
>>>>>>> The last thing I want is this code to be copied somewhere else.
>>>>>>
>>>>>> Factored as needed not copied. VMError is not an "info" reporting
>>>>>> mechanism.
>>>>>
>>>>> If you look at the things that are reported in each "STEP", there's a
>>>>> small amount of code and the order is important.
>>>>
>>>> The order is less important in an "info" request I would think.
>>>>
>>>>> I'd like the vm info to use the same order and report what it can do
>>>>> safely.   Refactoring 5 lines of code into functions doesn't make
>>>>> sense.
>>>>
>>>> I need to consider exactly what it is the "info" needs to report in
>>>> more detail. There are existing facilities (system properties,
>>>> management
>>>> APIs) for various bits of runtime information, which VMError can't
>>>> utilize but a Dcmd can.
>>>>
>>>> David
>>>>
>>>>> Coleen
>>>>>
>>>>>>
>>>>>> David
>>>>>>
>>>>>>> Coleen
>>>>>>>
>>>>>>> On 10/28/15 8:48 PM, David Holmes wrote:
>>>>>>>> I agree with Fred, this kind of info reporting should not be
>>>>>>>> piggy-backed onto VMError handling for the reasons stated (and the
>>>>>>>> error handling logic is complicated enough as it is!). For things
>>>>>>>> like thread lists there are already safe management functions that
>>>>>>>> can/should be used.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David H.
>>>>>>>>
>>>>>>>> On 29/10/2015 3:29 AM, Frederic Parain wrote:
>>>>>>>>> Hi David,
>>>>>>>>>
>>>>>>>>> I haven't review all the code yet, but I'm concerned with the
>>>>>>>>> fact that the diagnostic command is calling VMError::report().
>>>>>>>>> This method has been implemented to be executed in the particular
>>>>>>>>> context of fatal errors, and its usage while the VM is running
>>>>>>>>> normally seems dangerous.
>>>>>>>>>
>>>>>>>>> For instance, VMError::report() consciously avoids grabbing locks
>>>>>>>>> because of the risk of deadlock during the error reporting.
>>>>>>>>> The consequence is that some data structures are browsed in an
>>>>>>>>> unsafe way. One example: VMError::report() calls
>>>>>>>>> Threads::print_on_error() which iterates over the thread list
>>>>>>>>> *without owning the Threads_lock*.
>>>>>>>>>
>>>>>>>>> The implementation of the diagnostic command seems also to
>>>>>>>>> exclude a lot of reporting from the initial VMError::report()
>>>>>>>>> method. Have you considered implementing a new MT-safe reporting
>>>>>>>>> method rather than trying to modify the special VMError::report()
>>>>>>>>> methods? (Note that some code factorization between
>>>>>>>>> VMError::report() and this new method should be possible).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Fred
>>>>>>>>>
>>>>>>>>> On 28/10/2015 17:18, david buck wrote:
>>>>>>>>>> Hi!
>>>>>>>>>>
>>>>>>>>>> Please review my change for this small enhancement.
>>>>>>>>>>
>>>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8027429
>>>>>>>>>> webrev: http://cr.openjdk.java.net/~dbuck/8027429_jdk9_01/
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> -Buck
>>>>>>>>>
>>>>>>>
>>>>>
> 
> 

-- 
Dmitry Samersoff
Oracle Java development team, Saint Petersburg, Russia
* I would love to change the world, but they won't give me the sources.