Need help with ZGC failure in Lilliput
rkennke at redhat.com
Tue Jul 20 16:21:56 UTC 2021
yes I have thought about this, but I am not sure if what I do is enough.
I'm basically following the logic that is implemented for hash-code: if
we encounter a stack-lock in the current thread, we can directly load
the displaced-header from it, otherwise it inflates the lock. I guess
that this is not really going perfect with GCs, because it means that,
e.g., concurrent marking or relocation would inflate some locks.
The trouble might indeed be that GC threads would not be allowed to do
that because of concurrent deflation. Hrmpf. Is there a way to prevent
deflation during certain GC phases maybe? Or coordinate GC threads with
Robbin Ehn's work sounds promising. Can you give me more details about
what he's up to? Maybe a PR?
> Hi Roman,
> Might need to catch up a bit with what you have done in detail. However,
> rather than studying your changes in detail, I'm just gonna throw out a
> few things that I know would blow up when doing something like this,
> unless you have done something to fix that.
> The main theme is displaced mark words. Due to issues with displaced
> mark words, I don't currently know of any (good and performant) way of
> having any stable bits in the markWord that can be read by mutators,
> when there are displaced mark words. This is why in the generational
> version of ZGC we are currently building, we do not encode any age bits
> in the markWord. Instead it is encoded as a per-region/page property,
> which is more reliable, and can use much lighter synchronization.
> Anyway, here comes a few displaced mark word issues:
> 1) A displaced mark word can point into an ObjectMonitor. However, the
> monitor can be concurrently deflated, and subsequently freed. The safe
> memory reclamation policy for ObjectMonitor unlinks the monitors first,
> then performs a thread-local handshake with all (...?!) threads, and
> then frees them after the handshake, when it knows that surely nobody is
> looking at these monitors any longer. Except of course concurrent GC
> threads do not take part in such handshakes, and therefore, concurrent
> GC threads are suddenly unable to safely read klasses, through displaced
> mark words, pointing into concurrently freeing ObjectMonitors. It can
> result in use-after-free. They are basically not allowed to dereference
> ObjectMonitor, without more synchronization code to allow that.
> 2) A displaced mark word can also point into a stack lock, right into
> the stack of a concurrently running thread. Naturally, this thread can
> concurrently die, and its stack be deallocated, or concurrently mutated
> after the lock is released on that thread. In other words, the memory of
> the stack on other threads is completely unreliable. The way in which
> this works regarding hashCode, which similarly needs to be read by
> various parties, is that the stack lock is concurrently inflated into an
> inflated lock, which is then a bit more stable to read through, given
> the right sync dance. Assuming of course, that the reading thread, takes
> part in the global handshake for SMR purposes.
> So yeah, not sure if you have thought about any of this. If not, it
> might be the issue you are chasing after. It's worth mentioning that
> Robbin Ehn is currently removing displaced mark words with his Java
> monitors work. That should make this kind of exercise easier.
> On 2021-07-20 13:47, Roman Kennke wrote:
>> Hi ZGC devs,
>> I am struggling with a ZGC problem in Lilliput, and would like to ask
>> for your opinion.
>> I'm currently working on changing runtime oopDesc::klass() to load the
>> Klass* from the object header instead of the dedicated Klass* field:
>> This required some coordination in other GCs, because it's not always
>> safe to access the object header. In particular, objects may be
>> locked, at which point we need to find the displaced header, or worst
>> case, inflate the header. I believe I've solved that in all GCs.
>> However, I am still getting a failure with ZGC, which is kinda
>> unexpected, because it's the only GC that is *not* messing with object
>> headers (as far as I know. If you check out the above PR, the failure
>> can easily reproduced with:
>> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java
>> (and only that test is failing for me).
>> The crash is in ZHeap::is_object_live() because the ZPage there turns
>> out to be NULL. I've added a bunch of debug output in that location,
>> and it looks like the offending object is always inflated *and*
>> forwarded when it happens, but I fail to see how this is related to
>> each other, and to the page being NULL. I strongly suspect that
>> inflation of the object header by calling klass() on it causes the
>> troubles. Changing back to original implementation of oopDesc::klass()
>> (swap commented-out-code there) makes the bug disappear.
>> Also, the bug always seems to happen when calling through a weak
>> barrier. Not sure if that is relevant.
>> Any ideas? Opinions?
More information about the zgc-dev