Project to improve hs_err files

Mattis Castegren mattis.castegren at oracle.com
Tue Sep 10 06:32:08 PDT 2013


Hi

I don't think these problems will be critical:

"1. As David already mentioned we are in context of crash handler, so
increasing the number of calls we are doing, we increase the probability
to finally get nothing."

Right now I am gathering feedback on what information would really help different teams. I am not investigating how feasible the changes are, that will be the next step. Some things, like printing the number of OOM Errors would be a very easy implementation that should not be a problem in a crash handler. Other things, like printing assembler around the PC may be too risky. Some of these suggestions may even have to be solved with a script you can run on the file afterwards, though that will make it a lot harder to do quick triaging, and will also make it a lot harder for end users to use. As David pointed out, there may be ways to improve stability, for example by having a crash dumper thread on standby.

"2. With too many information the hs_err file become unreadable and not
suitable for quick, pre-sustaining check on support side."

Yes, that is why I ask for feedback on what information would be really useful, not only for developers but for support and sustaining. Today, hs_err files start by dumping registers and stacks as raw hex data, while information that Support really needs like system information and VM Arguments is hidden in the middle of the file. We are currently not at a place where the hs_err files are easily readable. That is why my main focus is to ask support what they want so that we can make the hs_err files easier for quick, pre-sustaining (and even pre-support) checks. This project actually aims to solve this problem, not add to it.

"3. Crash is always a bad situation from security standpoint, so with too
many information customer might become reluctant to share it and we need
additional efforts to protect data."

Considering that we will need core files for most debugging anyway, I don't think this will be an issue. None of the current suggestions for improvements contain customer sensitive information. Once we do the implementation, this will probably be split into several enhancement requests. Securing sensitive data can be one thing to look at when we work these enhancements, but I don't think it makes sense to remove important information just because the customer MAY not want to share paths, JVM flags, etc. Since this is in clear text, they can make a pass themselves and remove things if they really want.

Kind Regards
/Mattis


-----Original Message-----
From: Dmitry Samersoff 
Sent: den 9 september 2013 22:16
To: Mattis Castegren
Cc: serviceability-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net
Subject: Re: Project to improve hs_err files

Mattis,

I see three problems here:

1. As David already mentioned we are in context of crash handler, so
increasing the number of calls we are doing, we increase the probability
to finally get nothing.

2. With too many information the hs_err file become unreadable and not
suitable for quick, pre-sustaining check on support side.

3. Crash is always a bad situation from security standpoint, so with too
many information customer might become reluctant to share it and we need
additional efforts to protect data.

***

So first thing I would like to see is some kind of staging. i.e. I would
prefer to write down minimal hs_err.log, similar to one we have in 1.5
and than attempt to write as much information as possible to a separate
file with an option to turn it off.

Also on my opinion it's good to have some industry standard monitoring
options (SNMP, syslog etc) available on hotspot level to be able to
report JVM crash sysadmin friendly manner.

-Dmitry

On 2013-09-06 15:32, Mattis Castegren wrote:
> Hi (re-sending mail after joining the mailing lists, sorry if you get
> this mail twice)
> 
>  
> 
> My name is Mattis and I work with the JVM sustaining engineering team at
> Oracle. I am starting up a project to improve the data we get in the
> hs_err files when the JVM crashes. I have filed a JEP, but it has not
> yet been approved. See my attachment for the initial draft including
> motivation and scope. The main goal is not to completely solve new bugs
> by just using an hs_err file, but to help users, support and development
> to debug their problems, find duplicates of fixed bugs or application
> errors. It is also to provide more information that can be helpful when
> doing core file debugging on new issues.
> 
>  
> 
> The first step in this project is to gather suggestions of data that
> could help us when we see crashes. I am talking to the rest of the
> sustaining engineering team and also to the Java support team, but I
> also wanted to ask if anyone on these aliases have any thoughts on what
> data would help when we get an hs_err file. I'm looking for both big and
> small suggestions. Deciding if the suggestions are feasible or not can
> be discussed later.
> 
> Suggestions so far:
> 
>  
> 
> * Bigger changes
> 
> - Re-structure hs_err file to put more important data first, maybe
> include a summary header. End users can't be expected to read through
> the entire hs_err file. If we can put important hints of what went wrong
> at the top, that could save a lot of time. Also, many web tools truncate
> hs_err files, so we may never see the end of the files. This would also
> help us to faster triage incoming incidents
> 
> - Look at context sensitive data. If we crash when compiling a method,
> what additional data could we provide. Can we provide anything when
> crashing in GC, or when running interpreted code?
> 
> - Could we verify data structures? If we could list that some GC table
> had been corrupted, that could give a hint at the problem as well as
> help with finding duplicates and known issues
> 
> - Suggest workarounds/first debug steps. Sometimes we immediately know
> what the first debug step is. If we crash when running a compiled
> method, try to disable compilation of that method. If we crash after
> several OOMs, try increasing the Java heap or lower heap usage. If we
> could print these first steps, this could lead to bug filers providing
> more data when they file a bug. We could also highlight "dangerous"
> options, like -Xverify:none
> 
>  
> 
> * Additional Data
> 
> - The GC Strategy used
> 
> - The classpath variable
> 
> - Have we seen any OutOfMemoryErrors, StackOverflowErrors or C Heap
> allocation fails?
> 
> - Include Hypervisor information. If we run on Xen or VMWare, that is
> very interesting data
> 
> - Heap usage and permgen usage (JDK7) after the last full GC. If the
> heap is 99% full, that is a good hint that we might have run into a
> corner case in OOM handling for example
> 
> - Write the names of the last exceptions thrown. We currently list the
> addresses of the last exceptions. Giving the type instead would be very
> good. Did we crash after 10 StackOverflowErrors? That's a good hint of
> what went wrong
> 
> - Make the GC Heap History more easy to read. It's a great feature, but
> it is somewhat hard to read if an event is a full GC or a YC etc.
> 
> - Assembler instructions in dump files. Right now, we print the code
> around the IP in hex format. This isn't very helpful. If we could get
> the crashing instructions as assembler instead, that would help a lot
> 
> - Growing and shrinking of heap. We have seen a few issues when growing
> or shrinking the java heap. Saving the last few increases and decreases
> with a timestamp would help finding out if this could be an issue
> 
> - Highlight if third party JNI libs have been used
> 
>  
> 
> Please let me know if you have ideas of what information would make
> hs_err files more useful, and I will add them to my list.
> 
>  
> 
> Kind Regards
> 
> /Mattis
> 


-- 
Dmitry Samersoff
Oracle Java development team, Saint Petersburg, Russia
* I would love to change the world, but they won't give me the source code.


More information about the hotspot-runtime-dev mailing list