RFR[13]: 8224531: SEGV while collecting Klass statistics

Wed Jul 3 12:50:02 UTC 2019

On 7/3/19 8:48 AM, Erik Österlund wrote:
> Hi Coleen,
>
> Thanks for the review.
>
> On 2019-07-03 14:10, coleen.phillimore at oracle.com wrote:
>>
>> http://cr.openjdk.java.net/~eosterlund/8224531/webrev.00/src/hotspot/share/memory/heapInspection.cpp.frames.html 
>>
>>
>> There's another object_iterate() in this file with a comment to 
>> change it to safe_object_iterate().  Should you change that too?
>
> It probably should change, but it does not seem to have the same 
> behaviour and I don't know if there is a real bug in that unrelated 
> code. But yeah I would prefer that to change too, but I think that 
> should be a separate change.

That's fine.
>
>> Did you run the jvmti tests?  There used to be tests that failed if 
>> dead objects weren't found, but the tests may have been fixed.
>
> I will take it for a spin. Note though that behaviour relying on 
> always getting dead objects will fail; the caller of the API will race 
> with concurrent sweeping and either get or not get the dead objects 
> depending on whether the heap iteration happened to kick in before or 
> after sweeping. There is just no way you can rely on that. So if there 
> is a test failure because of that, the test is wrong. Nevertheless, I 
> will try to hunt down such tests.
>

Yes, please.

Thanks,
Coleen
> Thanks,
> /Erik
>
>> The rest of the change looks good.  Thank you for figuring this out!
>> Coleen
>>
>> On 7/3/19 6:15 AM, Erik Österlund wrote:
>>> Hi,
>>>
>>> The heap inspection API performs pointer chasing through dead 
>>> memory. In particular, it populates a table with classes based on a 
>>> heap walk using CollectedHeap::object_iterate(). That API can give 
>>> you dead objects as well, whereas 
>>> CollectedHeap::safe_object_iterate() gives you only live objects.
>>> The possibly dead objects have their possibly dead Klass* recorded. 
>>> Then we call Klass::collect_statistics on said possibly dead 
>>> recorded classes. In there, we follow the possibly dead mirror oop 
>>> of the possibly dead Klass, and subsequently read the Klass* of the 
>>> possibly dead mirror.
>>> Here is an attempted ASCII art explaining this
>>>
>>> | Klass*      | -> |            |
>>> | hi I am a   |    | hi I am a  |
>>> | dead object |    | dead Klass |
>>>                    |            |
>>>                    | oop mirror | -> | Klass*      | -> | Class klass |
>>>                                      | hi I am a   |
>>>                                      | dead mirror |
>>>
>>> So as you can see we pointer chase through this chain of possibly 
>>> dead memory. What could possibly go wrong though?
>>> In CMS a crash can manifest in the following way:
>>>
>>> In a concurrent collection, both the object and its class die. They 
>>> are dead once we reach final marking. But the memory is kept around 
>>> and being swept by concurrent sweeping.
>>> The sweeping yields to young collections (controllable through the 
>>> CMSYield flag). So what can happen is that the mirror is swept 
>>> (which already is a problem in debug builds because we zap the free 
>>> chunk of memory, but hold on there is a problem in in product builds 
>>> too) and gets added to a free list. In the yielded safepoint we may 
>>> perform a young collection that promotes objects to the memory of 
>>> the freed chunk (where the mirror used to be, except due to 
>>> coalescing of freed chunks, there might not be a Klass* pointer 
>>> where there used to be one for the mirror). And then, before 
>>> sweeping finishes, the heap inspection API is called. That API 
>>> sometimes tries to perform a STW GC first to get only live objects, 
>>> but that GC may fail because of the JNI gc locker. And then it just 
>>> goes ahead calling the unsafe object_iterate API anyway. The 
>>> object_iterate API will pass in the object to the closure but not 
>>> the mirror as it has been freed and reused. Buuut... since we 
>>> pointer chase through the dead object to the stale reference to the 
>>> dead mirror, we eventually find ourselves in an awkward situation 
>>> where we try to read and use a Klass* that might really be a 
>>> primitive vaue now (read: crash).
>>>
>>> The general rule of thumb is that pointer chasing through dead 
>>> memory should NOT be done. We allow it in a few rare situations with 
>>> the following constraints: 1) You have to use AS_NO_KEEPALIVE when 
>>> reading dead oops, or things can blow up, 2) You may only read dead 
>>> oops if you are the GC and hence can control that the memory it 
>>> points at has not and will not be freed until your read finishes 
>>> (...because you are the GC).
>>> Neither of these two constraints hold here. We read the mirrors 
>>> without AS_NO_KEEPALIVE in the pointer chase, and we can not control 
>>> that the memory it points at has not been freed. Therefore, this is 
>>> an invalid use of pointer chasing through dead memory. The fix is 
>>> simple: use the safe_object_iterate API instead, which only hands 
>>> out live objects. I also sprinkled in no_keepalive decorators on the 
>>> mirrors because it's good practice to not use that for such use 
>>> cases where you clobber the whole heap (causing it to be marked in 
>>> ZGC) but really just read an some int or something from the oop, 
>>> without publishing any references to the oop.
>>>
>>> I tested this with 100 kitchensink iterations without my fix (failed 
>>> 2 times) and 100 kitchensink iterations with my fix (failed 0 times).
>>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8224531
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~eosterlund/8224531/webrev.00/
>>>
>>> Thanks,
>>> /Erik
>>
>