Need help with ZGC failure in Lilliput
Erik Österlund
erik.osterlund at oracle.com
Tue Jul 20 14:48:32 UTC 2021
Hi Roman,
Might need to catch up a bit with what you have done in detail. However,
rather than studying your changes in detail, I'm just gonna throw out a
few things that I know would blow up when doing something like this,
unless you have done something to fix that.
The main theme is displaced mark words. Due to issues with displaced
mark words, I don't currently know of any (good and performant) way of
having any stable bits in the markWord that can be read by mutators,
when there are displaced mark words. This is why in the generational
version of ZGC we are currently building, we do not encode any age bits
in the markWord. Instead it is encoded as a per-region/page property,
which is more reliable, and can use much lighter synchronization.
Anyway, here comes a few displaced mark word issues:
1) A displaced mark word can point into an ObjectMonitor. However, the
monitor can be concurrently deflated, and subsequently freed. The safe
memory reclamation policy for ObjectMonitor unlinks the monitors first,
then performs a thread-local handshake with all (...?!) threads, and
then frees them after the handshake, when it knows that surely nobody is
looking at these monitors any longer. Except of course concurrent GC
threads do not take part in such handshakes, and therefore, concurrent
GC threads are suddenly unable to safely read klasses, through displaced
mark words, pointing into concurrently freeing ObjectMonitors. It can
result in use-after-free. They are basically not allowed to dereference
ObjectMonitor, without more synchronization code to allow that.
2) A displaced mark word can also point into a stack lock, right into
the stack of a concurrently running thread. Naturally, this thread can
concurrently die, and its stack be deallocated, or concurrently mutated
after the lock is released on that thread. In other words, the memory of
the stack on other threads is completely unreliable. The way in which
this works regarding hashCode, which similarly needs to be read by
various parties, is that the stack lock is concurrently inflated into an
inflated lock, which is then a bit more stable to read through, given
the right sync dance. Assuming of course, that the reading thread, takes
part in the global handshake for SMR purposes.
So yeah, not sure if you have thought about any of this. If not, it
might be the issue you are chasing after. It's worth mentioning that
Robbin Ehn is currently removing displaced mark words with his Java
monitors work. That should make this kind of exercise easier.
Thanks,
/Erik
On 2021-07-20 13:47, Roman Kennke wrote:
> Hi ZGC devs,
>
> I am struggling with a ZGC problem in Lilliput, and would like to ask
> for your opinion.
>
> I'm currently working on changing runtime oopDesc::klass() to load the
> Klass* from the object header instead of the dedicated Klass* field:
>
> https://github.com/openjdk/lilliput/pull/12
>
> This required some coordination in other GCs, because it's not always
> safe to access the object header. In particular, objects may be locked,
> at which point we need to find the displaced header, or worst case,
> inflate the header. I believe I've solved that in all GCs.
>
> However, I am still getting a failure with ZGC, which is kinda
> unexpected, because it's the only GC that is *not* messing with object
> headers (as far as I know. If you check out the above PR, the failure
> can easily reproduced with:
>
> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java
>
> (and only that test is failing for me).
>
> The crash is in ZHeap::is_object_live() because the ZPage there turns
> out to be NULL. I've added a bunch of debug output in that location, and
> it looks like the offending object is always inflated *and* forwarded
> when it happens, but I fail to see how this is related to each other,
> and to the page being NULL. I strongly suspect that inflation of the
> object header by calling klass() on it causes the troubles. Changing
> back to original implementation of oopDesc::klass() (swap
> commented-out-code there) makes the bug disappear.
>
> Also, the bug always seems to happen when calling through a weak
> barrier. Not sure if that is relevant.
>
> Any ideas? Opinions?
>
> Thanks,
> Roman
>
More information about the zgc-dev
mailing list