RFR(M): 8210754: print_location is not reliable enough (printing register info)

Sun Sep 23 13:01:56 UTC 2018

Hi Martin,

Sorry for the delay - I'm travelling at the moment.

On 18/09/2018 3:28 AM, Doerr, Martin wrote:
> Hi David,
> 
> I guess you wanted to have a more complete description of what my change does. Here it is:
> - The main part of it is to replace the very poor is_oop check by much more reliable checks to determine if we can dump a Java object:
>    - Check if the required fields point into the expected memory regions.
>    - Check if these regions are still alive.
> - Add support for compressed oops and class pointers which need to get decoded first. (Raw decoding without assertions.)
> - Move the lengthy code blob dumping code out of os.cpp (unchanged).
> - Add some comments and line breaks to improve readability.

Thanks for the extra info. I do get concerned that we try to do far too 
much in the error handler, and that we risk losing information if we hit 
secondary errors.

The details of this need to be reviewed by others as I don't know the 
code well enough to spot any issues. I've cc'd Coleen to see if she can 
take a look (thanks Coleen!).

One thing I did spot, in vmErrorHelper::is_readable_range should the 4*K 
actually be obtained from some page-size function?

Thanks,
David
-----

> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Doerr, Martin
> Sent: Montag, 17. September 2018 09:57
> To: 'David Holmes' <david.holmes at oracle.com>; hotspot-runtime-dev at openjdk.java.net
> Subject: RE: RFR(M): 8210754: print_location is not reliable enough (printing register info)
> 
> Hi David,
> 
> the kind of errors which I'd like to address are basically during the process of dumping Java objects.
> 
> The current "is_oop" check assumes almost every Java heap address to be a valid oop. This check is not sufficient for dumping it.
> Please note that the dumping code also accesses the metadata which may get modified concurrently. So my changes don't really make that worse.
> 
> I'm assuming that Java heap modifications happen far more often than modifications of the small piece of the metaspace which needs to get accessed in order to check a few Java objects.
> A crash in the iteration due to concurrent structural modification sounds like a very unlikely situation in which we might still get "error occurred during error reporting". IMHO nothing which sounds worse than what we currently have.
> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Montag, 17. September 2018 08:54
> To: Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net
> Subject: Re: RFR(M): 8210754: print_location is not reliable enough (printing register info)
> 
> Hi Martin,
> 
> On 17/09/2018 3:41 PM, Doerr, Martin wrote:
>> Hi David,
>>
>> thanks for looking at my proposal.
>>
>> I'm aware of that the new code accesses memory which may be mutated concurrently.
>> But I'm convinced that this is far better than what we currently have. Analyzing the state of a crashed VM can never be 100% safe.
> 
> Can you summarise what the causes of the secondary errors were and how
> this additional set of checks tries to deal with that please. This looks
> like its trying to do more than just improve reliability - and some
> parts seem potentially just as unreliable (not that it may not be useful
> when it does work - though how could you tell if you walk a bad pointer
> when examining the CLDGraph?).
> 
>> I could use try_lock to improve this situation. When I get the lock, fine.
>> But what should we do when the lock is held by the code which has crashed?
>> I think we shouldn't wait for any lock. It's better to risk errors due to concurrent mutation which seems to be not so likely.
> 
> Definitely do not want to take locks. :)
> 
> My continual concern with the ever expanding error reporting code is
> that every change, whilst improving one scenario, potentially degrades
> others.
> 
> Cheers,
> David
> 
>> Best regards,
>> Martin
>>
>>
>> -----Original Message-----
>> From: David Holmes <david.holmes at oracle.com>
>> Sent: Montag, 17. September 2018 07:07
>> To: Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net
>> Subject: Re: RFR(M): 8210754: print_location is not reliable enough (printing register info)
>>
>> Hi Martin,
>>
>> On 15/09/2018 12:03 AM, Doerr, Martin wrote:
>>> Hi,
>>>
>>> I'd like to make os::print_location more reliable which is used in error reporting step "printing register info". Oops and Klasses should get inspected more carefully.
>>
>> But some of what you are doing is accessing shared state that could be
>> mutated concurrently with the error reporting thread that is trying to
>> read it e.g. walking the ClassLoaderDataGraph!
>>
>> David
>> -----
>>
>>> I have seen errors like "[error occurred during error reporting (printing register info), id 0xe0000000, Internal Error (/usr/work/d056149/openjdk/jdk/src/hotspot/share/oops/klass.inline.hpp:63)]" in many hs_err files.
>>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8210754
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/
>>>
>>> Sometimes, I get such errors when using -XX:+CrashGCForDumpingJavaThread, sometimes when injecting crashing code into compiled methods which I did by the following code:
>>> http://cr.openjdk.java.net/~mdoerr/crash_C2_method/webrev.00/
>>> I can also contribute this if it's desired. Automatic tests would certainly be nice to have. Maybe I can find some time for that.
>>>
>>> Please review.
>>>
>>> Best regards,
>>> Martin
>>>