RFR(M): 8210754: print_location is not reliable enough (printing register info)

Mon Sep 17 06:54:00 UTC 2018

Hi Martin,

On 17/09/2018 3:41 PM, Doerr, Martin wrote:
> Hi David,
> 
> thanks for looking at my proposal.
> 
> I'm aware of that the new code accesses memory which may be mutated concurrently.
> But I'm convinced that this is far better than what we currently have. Analyzing the state of a crashed VM can never be 100% safe.

Can you summarise what the causes of the secondary errors were and how 
this additional set of checks tries to deal with that please. This looks 
like its trying to do more than just improve reliability - and some 
parts seem potentially just as unreliable (not that it may not be useful 
when it does work - though how could you tell if you walk a bad pointer 
when examining the CLDGraph?).

> I could use try_lock to improve this situation. When I get the lock, fine.
> But what should we do when the lock is held by the code which has crashed?
> I think we shouldn't wait for any lock. It's better to risk errors due to concurrent mutation which seems to be not so likely.

Definitely do not want to take locks. :)

My continual concern with the ever expanding error reporting code is 
that every change, whilst improving one scenario, potentially degrades 
others.

Cheers,
David

> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Montag, 17. September 2018 07:07
> To: Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net
> Subject: Re: RFR(M): 8210754: print_location is not reliable enough (printing register info)
> 
> Hi Martin,
> 
> On 15/09/2018 12:03 AM, Doerr, Martin wrote:
>> Hi,
>>
>> I'd like to make os::print_location more reliable which is used in error reporting step "printing register info". Oops and Klasses should get inspected more carefully.
> 
> But some of what you are doing is accessing shared state that could be
> mutated concurrently with the error reporting thread that is trying to
> read it e.g. walking the ClassLoaderDataGraph!
> 
> David
> -----
> 
>> I have seen errors like "[error occurred during error reporting (printing register info), id 0xe0000000, Internal Error (/usr/work/d056149/openjdk/jdk/src/hotspot/share/oops/klass.inline.hpp:63)]" in many hs_err files.
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8210754
>>
>> Webrev:
>> http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/
>>
>> Sometimes, I get such errors when using -XX:+CrashGCForDumpingJavaThread, sometimes when injecting crashing code into compiled methods which I did by the following code:
>> http://cr.openjdk.java.net/~mdoerr/crash_C2_method/webrev.00/
>> I can also contribute this if it's desired. Automatic tests would certainly be nice to have. Maybe I can find some time for that.
>>
>> Please review.
>>
>> Best regards,
>> Martin
>>