Need help with ZGC failure in Lilliput
rkennke at redhat.com
Tue Jul 20 19:08:47 UTC 2021
Alright, I disabled monitor deflation, and the problem persists.
I narrowed the problem down to two (closely related) calls to
ZUtils::object_size() (which in turn calls oopDesc::size() which calls
oopDesc::klass()), which seem to trigger it: ZLiveMap::iterate() and
ZRelocateClosure::relocate_object(). When I change those two calls to
use the Klass* from the dedicated field, the failure disappears. I don't
think that the actual Klass* disagrees (I have asserts to check that),
which means it can only be the side effect of that call -- most likely
the header inflation. How this inflation would affect ZGC is still not
entirely clear to me.
The problem always happens during weak storage scan. Inflation of a lock
creates new weak handle that points back to the object. I suspect that,
at the very least, this must not point to the old copy of that object. I
need to think this through.
> Hi Erik,
> yes I have thought about this, but I am not sure if what I do is enough.
> I'm basically following the logic that is implemented for hash-code: if
> we encounter a stack-lock in the current thread, we can directly load
> the displaced-header from it, otherwise it inflates the lock. I guess
> that this is not really going perfect with GCs, because it means that,
> e.g., concurrent marking or relocation would inflate some locks.
> The trouble might indeed be that GC threads would not be allowed to do
> that because of concurrent deflation. Hrmpf. Is there a way to prevent
> deflation during certain GC phases maybe? Or coordinate GC threads with
> Robbin Ehn's work sounds promising. Can you give me more details about
> what he's up to? Maybe a PR?
>> Hi Roman,
>> Might need to catch up a bit with what you have done in detail.
>> However, rather than studying your changes in detail, I'm just gonna
>> throw out a few things that I know would blow up when doing something
>> like this, unless you have done something to fix that.
>> The main theme is displaced mark words. Due to issues with displaced
>> mark words, I don't currently know of any (good and performant) way of
>> having any stable bits in the markWord that can be read by mutators,
>> when there are displaced mark words. This is why in the generational
>> version of ZGC we are currently building, we do not encode any age
>> bits in the markWord. Instead it is encoded as a per-region/page
>> property, which is more reliable, and can use much lighter
>> synchronization. Anyway, here comes a few displaced mark word issues:
>> 1) A displaced mark word can point into an ObjectMonitor. However, the
>> monitor can be concurrently deflated, and subsequently freed. The safe
>> memory reclamation policy for ObjectMonitor unlinks the monitors
>> first, then performs a thread-local handshake with all (...?!)
>> threads, and then frees them after the handshake, when it knows that
>> surely nobody is looking at these monitors any longer. Except of
>> course concurrent GC threads do not take part in such handshakes, and
>> therefore, concurrent GC threads are suddenly unable to safely read
>> klasses, through displaced mark words, pointing into concurrently
>> freeing ObjectMonitors. It can result in use-after-free. They are
>> basically not allowed to dereference ObjectMonitor, without more
>> synchronization code to allow that.
>> 2) A displaced mark word can also point into a stack lock, right into
>> the stack of a concurrently running thread. Naturally, this thread can
>> concurrently die, and its stack be deallocated, or concurrently
>> mutated after the lock is released on that thread. In other words, the
>> memory of the stack on other threads is completely unreliable. The way
>> in which this works regarding hashCode, which similarly needs to be
>> read by various parties, is that the stack lock is concurrently
>> inflated into an inflated lock, which is then a bit more stable to
>> read through, given the right sync dance. Assuming of course, that the
>> reading thread, takes part in the global handshake for SMR purposes.
>> So yeah, not sure if you have thought about any of this. If not, it
>> might be the issue you are chasing after. It's worth mentioning that
>> Robbin Ehn is currently removing displaced mark words with his Java
>> monitors work. That should make this kind of exercise easier.
>> On 2021-07-20 13:47, Roman Kennke wrote:
>>> Hi ZGC devs,
>>> I am struggling with a ZGC problem in Lilliput, and would like to ask
>>> for your opinion.
>>> I'm currently working on changing runtime oopDesc::klass() to load
>>> the Klass* from the object header instead of the dedicated Klass* field:
>>> This required some coordination in other GCs, because it's not always
>>> safe to access the object header. In particular, objects may be
>>> locked, at which point we need to find the displaced header, or worst
>>> case, inflate the header. I believe I've solved that in all GCs.
>>> However, I am still getting a failure with ZGC, which is kinda
>>> unexpected, because it's the only GC that is *not* messing with
>>> object headers (as far as I know. If you check out the above PR, the
>>> failure can easily reproduced with:
>>> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java
>>> (and only that test is failing for me).
>>> The crash is in ZHeap::is_object_live() because the ZPage there turns
>>> out to be NULL. I've added a bunch of debug output in that location,
>>> and it looks like the offending object is always inflated *and*
>>> forwarded when it happens, but I fail to see how this is related to
>>> each other, and to the page being NULL. I strongly suspect that
>>> inflation of the object header by calling klass() on it causes the
>>> troubles. Changing back to original implementation of
>>> oopDesc::klass() (swap commented-out-code there) makes the bug
>>> Also, the bug always seems to happen when calling through a weak
>>> barrier. Not sure if that is relevant.
>>> Any ideas? Opinions?
More information about the zgc-dev