Need help with ZGC failure in Lilliput

Erik Österlund erik.osterlund at oracle.com
Tue Jul 20 14:48:32 UTC 2021


Hi Roman,

Might need to catch up a bit with what you have done in detail. However, 
rather than studying your changes in detail, I'm just gonna throw out a 
few things that I know would blow up when doing something like this, 
unless you have done something to fix that.

The main theme is displaced mark words. Due to issues with displaced 
mark words, I don't currently know of any (good and performant) way of 
having any stable bits in the markWord that can be read by mutators, 
when there are displaced mark words. This is why in the generational 
version of ZGC we are currently building, we do not encode any age bits 
in the markWord. Instead it is encoded as a per-region/page property, 
which is more reliable, and can use much lighter synchronization. 
Anyway, here comes a few displaced mark word issues:

1) A displaced mark word can point into an ObjectMonitor. However, the 
monitor can be concurrently deflated, and subsequently freed. The safe 
memory reclamation policy for ObjectMonitor unlinks the monitors first, 
then performs a thread-local handshake with all (...?!) threads, and 
then frees them after the handshake, when it knows that surely nobody is 
looking at these monitors any longer. Except of course concurrent GC 
threads do not take part in such handshakes, and therefore, concurrent 
GC threads are suddenly unable to safely read klasses, through displaced 
mark words, pointing into concurrently freeing ObjectMonitors. It can 
result in use-after-free. They are basically not allowed to dereference 
ObjectMonitor, without more synchronization code to allow that.

2) A displaced mark word can also point into a stack lock, right into 
the stack of a concurrently running thread. Naturally, this thread can 
concurrently die, and its stack be deallocated, or concurrently mutated 
after the lock is released on that thread. In other words, the memory of 
the stack on other threads is completely unreliable. The way in which 
this works regarding hashCode, which similarly needs to be read by 
various parties, is that the stack lock is concurrently inflated into an 
inflated lock, which is then a bit more stable to read through, given 
the right sync dance. Assuming of course, that the reading thread, takes 
part in the global handshake for SMR purposes.

So yeah, not sure if you have thought about any of this. If not, it 
might be the issue you are chasing after. It's worth mentioning that 
Robbin Ehn is currently removing displaced mark words with his Java 
monitors work. That should make this kind of exercise easier.

Thanks,
/Erik

On 2021-07-20 13:47, Roman Kennke wrote:
> Hi ZGC devs,
> 
> I am struggling with a ZGC problem in Lilliput, and would like to ask 
> for your opinion.
> 
> I'm currently working on changing runtime oopDesc::klass() to load the 
> Klass* from the object header instead of the dedicated Klass* field:
> 
> https://github.com/openjdk/lilliput/pull/12
> 
> This required some coordination in other GCs, because it's not always 
> safe to access the object header. In particular, objects may be locked, 
> at which point we need to find the displaced header, or worst case, 
> inflate the header. I believe I've solved that in all GCs.
> 
> However, I am still getting a failure with ZGC, which is kinda 
> unexpected, because it's the only GC that is *not* messing with object 
> headers (as far as I know. If you check out the above PR, the failure 
> can easily reproduced with:
> 
> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java
> 
> (and only that test is failing for me).
> 
> The crash is in ZHeap::is_object_live() because the ZPage there turns 
> out to be NULL. I've added a bunch of debug output in that location, and 
> it looks like the offending object is always inflated *and* forwarded 
> when it happens, but I fail to see how this is related to each other, 
> and to the page being NULL. I strongly suspect that inflation of the 
> object header by calling klass() on it causes the troubles. Changing 
> back to original implementation of oopDesc::klass() (swap 
> commented-out-code there) makes the bug disappear.
> 
> Also, the bug always seems to happen when calling through a weak 
> barrier. Not sure if that is relevant.
> 
> Any ideas? Opinions?
> 
> Thanks,
> Roman
> 


More information about the zgc-dev mailing list