[15?] RFR (S): 8249192: MonitorInfo stores raw oops across safepoints

Tue Jul 21 18:00:11 UTC 2020

Hi Coleen and David,

   thanks for your reviews.

On 21.07.20 03:29, David Holmes wrote:
> Hi Thomas,
> 
> On 21/07/2020 12:49 am, Thomas Schatzl wrote:
>> Forwarding to hotspot-dev where it belongs after wrongly sending to 
>> hotspot-gc-dev.
> 
> This touches serviceability code as well so cc'ing for good measure.
> 
> Thanks for taking this one on as it wasn't actually a GC issue!

I have been looking into strange G1 crashes with changed young gen 
sizing in tier8, which looked very similar to the crashes in that 
StackWalker/LocalsAndOperands.java test (thanks to Dean Long making us 
aware of these crashes).
I guessed correctly in hindsight, that the main difference between C2 
and Graal would be GC timing, so the problem could be related. Also we 
have some P2s for 15 with the same stack trace as in this reproducer and 
the mentioned tier8 (Kitchensink, 24h dacapo) failures that never 
reproduced... given that the issue reproduced much quicker in 
LocalsAndOperands (and given its source code gives a pretty narrow area 
where to look) it seemed easier to start with this one...

>>
>> -------- Forwarded Message --------
>> Subject: [15?] RFR (S): 8249192: MonitorInfo stores raw oops across 
>> safepoints
>> Date: Mon, 20 Jul 2020 12:07:38 +0200
>> From: Thomas Schatzl <thomas.schatzl at oracle.com>
>> To: hotspot-dev at openjdk.java.net <hotspot-gc-dev at openjdk.java.net>
>>
>> Hi all,
>>
>>    can I get some reviews to handle'ize some raw oops in the 
>> MonitorInfo class?
>>
>> (Afaiu only) in LiveFrameStream::monitors_to_object_array() we try to 
>> allocate an objArray with raw oops held in the MonitorInfo class that 
>> are passed in a GrowableArray. This allocation can lead to a garbage 
>> collection, with the usual random crashes.
> 
> Right - seems so obvious now. <sigh>
> 
> Took me a while to convince myself no such similar problem was lurking 
> in the JVM TI code.
> 
>> This change changes the raw oops in MonitorInfo to Handles, 
> 
> My main concern here was whether the MonitorInfo objects are thread 
> confined. For the StackWalker API we are always dealing with the current 
> thread so that is fine. For JVM TI, in mainline, we may be executing 
> code in the calling thread or the target thread; and in older releases 
> it will be the VMThread at a safepoint. But it seems that the 
> MonitorInfo's are confined to whichever thread that is, and so Handle 
> usage is safe.

> 
>> and adds a few HandleMarks along the way to make these handles go away 
>> asap.
> 
> That, and the ResourceMark changes, were a bit hard to follow. Basically 
> a HandleMark is now present in the scope of, or just above, the call to 
> monitors(). The need for the additional ResourceMarks is far from clear 
> though. In particular I wonder if the RM introduced in 
> Deoptimization::revoke_from_deopt_handler interacts with the special 
> DeoptResourceMark in its caller 
> Deoptimization::fetch_unroll_info_helper? (I have no idea what a 
> DeoptResourceMark is.)

The DeoptResourceMark in this case seems to act just like a regular 
ResourceMark, using the thread's resource area.

Other than acting like a ResourceMark, the DeoptResourceMark only seems 
to be an indicator used in some asserts to verify that no further 
deoptimization is running.

Looking at the called BiasedLocking::revoke_own_lock(), it adds its own 
(regular) ResourceMark quite early (further indicating that normal 
ResourceMarks and DeoptResourceMarks should be "compatible"), meaning 
that only any resource object allocated in the resource area between the 
added ResourceMark in Deoptimization::revoke_from_deopt_handler() and 
the existing one in BiasedLocking::revoke_own_lock() would have a 
different lifetime than before. There is no resource object allocation 
in there, actually the only thing that happens is unpacking the contents 
of the passed handle.

The objects_to_revoke array in Deoptimization::revoke_from_deopt_handler 
does not escape the method too.

I was much more worried about the caching of a 
GrowableArray<MonitorInfo> in Thread::cached_monitor_info going on 
during biased locking...

> 
>>
>> This issue has been introduced in JDK-8140450: Implement Stack-Walking 
>> API in jdk9.
>>
>> The CR has been triaged as P3, but I would like to ask whether it 
>> might be good to increase its priority to P2 and apply for inclusion 
>> in 15. My arguments are as follows:
>>
>> - the original issue why I started looking at this were lots of 
>> seemingly random crashes (5 or 6 were reported and the change 
>> temporarily backed out for this reason) in tier8 with a g1 change that 
>> changed young gen sizing. These crashes including that young gen 
>> sizing change are all gone now with this bugfix.
>> I.e. this suggests that so far we seem to have not encountered this 
>> issue more frequently due to pure luck wrt to generation sizing.
>>
>> - it affects all collectors (naturally).
>>
>> - there are quite a few user reported random crashes with IntelliJ and 
>> variants, which due to the nature of IDEs tending to retrieve stack 
>> traces fairly frequently would be more affected than usual. So I 
>> suspect at least some of them to be caused by this issue, these are 
>> the only raw oops I am aware of.
>>
>> My understanding of the cause and fix is fairly good, but I am no 
>> expert in this area, so I would like to defer to you about this 
>> suggestion. The change is imo important enough to be backported to 11 
>> and 15 anyway, but the question is about the risk/reward tradeoff wrt 
>> to bringing it to 15 and not 15.0.1.
> 
> I'd classify this as a P2 without doubt. As Dan noted there is no 
> workaround as such.
> 
>> CR:
>> https://bugs.openjdk.java.net/browse/JDK-8249192
>> Webrev:
>> http://cr.openjdk.java.net/~tschatzl/8249192/webrev/
> 
> src/hotspot/share/runtime/deoptimization.cpp
> 
> The code in collect_monitors takes the monitor owner oop and Handelises 
> it to add to its own GrowableArray of Handles. Is it worth exposing the 
> MonitorInfo owner() in Handle form to avoid this unwrapping and 
> re-wrapping?

We talked a bit about this internally and came to the conclusion to 
initially provide a small fix for 15 and 16, and do that suggested 
refactoring in 16 only.

> 
> src/hotspot/share/runtime/vframe.hpp
> 
> I agree with Coleen that the MonitorInfo constructor should not take a 
> Thread* but should itself materialize and use Thread::current().

Fixed.

New webrevs (jdk16):

http://cr.openjdk.java.net/~tschatzl/8249192/webrev.0_to_1/ (diff)
http://cr.openjdk.java.net/~tschatzl/8249192/webrev.1/ (full)

jdk15:

http://cr.openjdk.java.net/~tschatzl/8249192/webrev.jdk15.1/

The only difference is that JDK-8247729 in 16 changed a

ResourceMark rm;

in jdk15 (jvmtiEnvBase.cpp:1029) from

ResourceMark rm(current_thread);

in jdk16 (jvmtiEnvBase.cpp:1008)

which gives a merge error now. See also 
https://hg.openjdk.java.net/jdk/jdk/rev/f8a9be0f9e1a#l2.82 .

Started another tier1-5 run for jdk15; both versions passed 1.2k 
iterations of the LocalsAndOperands.java test again.

Thanks,
   Thomas