RFR(M): 8210754: print_location is not reliable enough (printing register info)

Doerr, Martin martin.doerr at sap.com
Tue Sep 18 07:28:16 UTC 2018


Hi David,

I guess you wanted to have a more complete description of what my change does. Here it is:
- The main part of it is to replace the very poor is_oop check by much more reliable checks to determine if we can dump a Java object:
  - Check if the required fields point into the expected memory regions.
  - Check if these regions are still alive.
- Add support for compressed oops and class pointers which need to get decoded first. (Raw decoding without assertions.)
- Move the lengthy code blob dumping code out of os.cpp (unchanged).
- Add some comments and line breaks to improve readability.

Best regards,
Martin


-----Original Message-----
From: Doerr, Martin 
Sent: Montag, 17. September 2018 09:57
To: 'David Holmes' <david.holmes at oracle.com>; hotspot-runtime-dev at openjdk.java.net
Subject: RE: RFR(M): 8210754: print_location is not reliable enough (printing register info)

Hi David,

the kind of errors which I'd like to address are basically during the process of dumping Java objects.

The current "is_oop" check assumes almost every Java heap address to be a valid oop. This check is not sufficient for dumping it.
Please note that the dumping code also accesses the metadata which may get modified concurrently. So my changes don't really make that worse.

I'm assuming that Java heap modifications happen far more often than modifications of the small piece of the metaspace which needs to get accessed in order to check a few Java objects. 
A crash in the iteration due to concurrent structural modification sounds like a very unlikely situation in which we might still get "error occurred during error reporting". IMHO nothing which sounds worse than what we currently have.

Best regards,
Martin


-----Original Message-----
From: David Holmes <david.holmes at oracle.com> 
Sent: Montag, 17. September 2018 08:54
To: Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net
Subject: Re: RFR(M): 8210754: print_location is not reliable enough (printing register info)

Hi Martin,

On 17/09/2018 3:41 PM, Doerr, Martin wrote:
> Hi David,
> 
> thanks for looking at my proposal.
> 
> I'm aware of that the new code accesses memory which may be mutated concurrently.
> But I'm convinced that this is far better than what we currently have. Analyzing the state of a crashed VM can never be 100% safe.

Can you summarise what the causes of the secondary errors were and how 
this additional set of checks tries to deal with that please. This looks 
like its trying to do more than just improve reliability - and some 
parts seem potentially just as unreliable (not that it may not be useful 
when it does work - though how could you tell if you walk a bad pointer 
when examining the CLDGraph?).

> I could use try_lock to improve this situation. When I get the lock, fine.
> But what should we do when the lock is held by the code which has crashed?
> I think we shouldn't wait for any lock. It's better to risk errors due to concurrent mutation which seems to be not so likely.

Definitely do not want to take locks. :)

My continual concern with the ever expanding error reporting code is 
that every change, whilst improving one scenario, potentially degrades 
others.

Cheers,
David

> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Montag, 17. September 2018 07:07
> To: Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net
> Subject: Re: RFR(M): 8210754: print_location is not reliable enough (printing register info)
> 
> Hi Martin,
> 
> On 15/09/2018 12:03 AM, Doerr, Martin wrote:
>> Hi,
>>
>> I'd like to make os::print_location more reliable which is used in error reporting step "printing register info". Oops and Klasses should get inspected more carefully.
> 
> But some of what you are doing is accessing shared state that could be
> mutated concurrently with the error reporting thread that is trying to
> read it e.g. walking the ClassLoaderDataGraph!
> 
> David
> -----
> 
>> I have seen errors like "[error occurred during error reporting (printing register info), id 0xe0000000, Internal Error (/usr/work/d056149/openjdk/jdk/src/hotspot/share/oops/klass.inline.hpp:63)]" in many hs_err files.
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8210754
>>
>> Webrev:
>> http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/
>>
>> Sometimes, I get such errors when using -XX:+CrashGCForDumpingJavaThread, sometimes when injecting crashing code into compiled methods which I did by the following code:
>> http://cr.openjdk.java.net/~mdoerr/crash_C2_method/webrev.00/
>> I can also contribute this if it's desired. Automatic tests would certainly be nice to have. Maybe I can find some time for that.
>>
>> Please review.
>>
>> Best regards,
>> Martin
>>


More information about the hotspot-runtime-dev mailing list