RFR[13]: 8224531: SEGV while collecting Klass statistics

Wed Jul 3 10:15:47 UTC 2019

Hi,

The heap inspection API performs pointer chasing through dead memory. In 
particular, it populates a table with classes based on a heap walk using 
CollectedHeap::object_iterate(). That API can give you dead objects as 
well, whereas CollectedHeap::safe_object_iterate() gives you only live 
objects.
The possibly dead objects have their possibly dead Klass* recorded. Then 
we call Klass::collect_statistics on said possibly dead recorded 
classes. In there, we follow the possibly dead mirror oop of the 
possibly dead Klass, and subsequently read the Klass* of the possibly 
dead mirror.
Here is an attempted ASCII art explaining this

| Klass*      | -> |            |
| hi I am a   |    | hi I am a  |
| dead object |    | dead Klass |
                    |            |
                    | oop mirror | -> | Klass*      | -> | Class klass |
                                      | hi I am a   |
                                      | dead mirror |

So as you can see we pointer chase through this chain of possibly dead 
memory. What could possibly go wrong though?
In CMS a crash can manifest in the following way:

In a concurrent collection, both the object and its class die. They are 
dead once we reach final marking. But the memory is kept around and 
being swept by concurrent sweeping.
The sweeping yields to young collections (controllable through the 
CMSYield flag). So what can happen is that the mirror is swept (which 
already is a problem in debug builds because we zap the free chunk of 
memory, but hold on there is a problem in in product builds too) and 
gets added to a free list. In the yielded safepoint we may perform a 
young collection that promotes objects to the memory of the freed chunk 
(where the mirror used to be, except due to coalescing of freed chunks, 
there might not be a Klass* pointer where there used to be one for the 
mirror). And then, before sweeping finishes, the heap inspection API is 
called. That API sometimes tries to perform a STW GC first to get only 
live objects, but that GC may fail because of the JNI gc locker. And 
then it just goes ahead calling the unsafe object_iterate API anyway. 
The object_iterate API will pass in the object to the closure but not 
the mirror as it has been freed and reused. Buuut... since we pointer 
chase through the dead object to the stale reference to the dead mirror, 
we eventually find ourselves in an awkward situation where we try to 
read and use a Klass* that might really be a primitive vaue now (read: 
crash).

The general rule of thumb is that pointer chasing through dead memory 
should NOT be done. We allow it in a few rare situations with the 
following constraints: 1) You have to use AS_NO_KEEPALIVE when reading 
dead oops, or things can blow up, 2) You may only read dead oops if you 
are the GC and hence can control that the memory it points at has not 
and will not be freed until your read finishes (...because you are the GC).
Neither of these two constraints hold here. We read the mirrors without 
AS_NO_KEEPALIVE in the pointer chase, and we can not control that the 
memory it points at has not been freed. Therefore, this is an invalid 
use of pointer chasing through dead memory. The fix is simple: use the 
safe_object_iterate API instead, which only hands out live objects. I 
also sprinkled in no_keepalive decorators on the mirrors because it's 
good practice to not use that for such use cases where you clobber the 
whole heap (causing it to be marked in ZGC) but really just read an some 
int or something from the oop, without publishing any references to the oop.

I tested this with 100 kitchensink iterations without my fix (failed 2 
times) and 100 kitchensink iterations with my fix (failed 0 times).

Bug:
https://bugs.openjdk.java.net/browse/JDK-8224531

Webrev:
http://cr.openjdk.java.net/~eosterlund/8224531/webrev.00/

Thanks,
/Erik