Project to improve hs_err files

Fri Sep 6 05:44:00 PDT 2013

Hi Mattis,

just some quick comments:

On Fri, Sep 6, 2013 at 1:32 PM, Mattis Castegren
<mattis.castegren at oracle.com> wrote:
>
> Hi (re-sending mail after joining the mailing lists, sorry if you get this mail twice)
>
>
>
> My name is Mattis and I work with the JVM sustaining engineering team at Oracle. I am starting up a project to improve the data we get in the hs_err files when the JVM crashes. I have filed a JEP, but it has not yet been approved. See my attachment for the initial draft including motivation and scope.

There is already a similar JEP: JEP 146: Improve Fatal Error Logs
(http://openjdk.java.net/jeps/146)
Are they somehow related? Maybe the efforts should be combined?

>
> The main goal is not to completely solve new bugs by just using an hs_err file, but to help users, support and development to debug their problems, find duplicates of fixed bugs or application errors. It is also to provide more information that can be helpful when doing core file debugging on new issues.
>
>
>
> The first step in this project is to gather suggestions of data that could help us when we see crashes. I am talking to the rest of the sustaining engineering team and also to the Java support team, but I also wanted to ask if anyone on these aliases have any thoughts on what data would help when we get an hs_err file. I’m looking for both big and small suggestions. Deciding if the suggestions are feasible or not can be discussed later.
>
> Suggestions so far:
>
>
>
> * Bigger changes
>
> - Re-structure hs_err file to put more important data first, maybe include a summary header. End users can’t be expected to read through the entire hs_err file. If we can put important hints of what went wrong at the top, that could save a lot of time. Also, many web tools truncate hs_err files, so we may never see the end of the files. This would also help us to faster triage incoming incidents
>
> - Look at context sensitive data. If we crash when compiling a method, what additional data could we provide. Can we provide anything when crashing in GC, or when running interpreted code?
>
> - Could we verify data structures? If we could list that some GC table had been corrupted, that could give a hint at the problem as well as help with finding duplicates and known issues
>
> - Suggest workarounds/first debug steps. Sometimes we immediately know what the first debug step is. If we crash when running a compiled method, try to disable compilation of that method. If we crash after several OOMs, try increasing the Java heap or lower heap usage. If we could print these first steps, this could lead to bug filers providing more data when they file a bug. We could also highlight "dangerous" options, like -Xverify:none
>
>

- Catch crashes in the compiler and recompile the same method with
full debug output turned on (i.e. dump the graphs of every
optimization step until the crash)

>
> * Additional Data
>
> - The GC Strategy used
>
> - The classpath variable
>
> - Have we seen any OutOfMemoryErrors, StackOverflowErrors or C Heap allocation fails?
>
> - Include Hypervisor information. If we run on Xen or VMWare, that is very interesting data
>
> - Heap usage and permgen usage (JDK7) after the last full GC. If the heap is 99% full, that is a good hint that we might have run into a corner case in OOM handling for example
>
> - Write the names of the last exceptions thrown. We currently list the addresses of the last exceptions. Giving the type instead would be very good. Did we crash after 10 StackOverflowErrors? That’s a good hint of what went wrong
>
> - Make the GC Heap History more easy to read. It’s a great feature, but it is somewhat hard to read if an event is a full GC or a YC etc.
>
> - Assembler instructions in dump files. Right now, we print the code around the IP in hex format. This isn’t very helpful. If we could get the crashing instructions as assembler instead, that would help a lot
>

This can easily be done with the hsdis-library. Unfortunately the
hsdis-library can not bundeled with a commercial JDK because it is
based on the GNU-binutils which are GPL-only.

But we could do two things:
 - provide hsdis as a separate download (but this wouldn't help after a crash)
 - provide a simple tool based on hsdis which can post-process the
hs_err-file and translate the hex-dump into readable assembler.

> - Growing and shrinking of heap. We have seen a few issues when growing or shrinking the java heap. Saving the last few increases and decreases with a timestamp would help finding out if this could be an issue
>
> - Highlight if third party JNI libs have been used
>
>
>
> Please let me know if you have ideas of what information would make hs_err files more useful, and I will add them to my list.
>
>
>
> Kind Regards
>
> /Mattis