[External] : Re: Need help with ZGC failure in Lilliput

Tue Jul 20 21:23:08 UTC 2021

Hi Roman,

> Alright, I disabled monitor deflation, and the problem persists.

I see. Naturally, there may be multiple problems.

> I narrowed the problem down to two (closely related) calls to
> ZUtils::object_size() (which in turn calls oopDesc::size() which calls
> oopDesc::klass()), which seem to trigger it: ZLiveMap::iterate() and
> ZRelocateClosure::relocate_object(). When I change those two calls to use the
> Klass* from the dedicated field, the failure disappears. I don't think that the
> actual Klass* disagrees (I have asserts to check that), which means it can only be
> the side effect of that call -- most likely the header inflation. How this inflation
> would affect ZGC is still not entirely clear to me.

Okay.

> The problem always happens during weak storage scan. Inflation of a lock
> creates new weak handle that points back to the object. I suspect that, at the
> very least, this must not point to the old copy of that object. I need to think this
> through.

Is the lock inflation provoked by the GC thread itself, as a side effect of reading the klass?

/Erik

> Roman
> 
> > Hi Erik,
> >
> > yes I have thought about this, but I am not sure if what I do is enough.
> > I'm basically following the logic that is implemented for hash-code:
> > if we encounter a stack-lock in the current thread, we can directly
> > load the displaced-header from it, otherwise it inflates the lock. I
> > guess that this is not really going perfect with GCs, because it means
> > that, e.g., concurrent marking or relocation would inflate some locks.
> >
> > The trouble might indeed be that GC threads would not be allowed to do
> > that because of concurrent deflation. Hrmpf. Is there a way to prevent
> > deflation during certain GC phases maybe? Or coordinate GC threads
> > with deflation?
> >
> > Robbin Ehn's work sounds promising. Can you give me more details about
> > what he's up to? Maybe a PR?
> >
> > Thanks,
> > Roman
> >
> >> Hi Roman,
> >>
> >> Might need to catch up a bit with what you have done in detail.
> >> However, rather than studying your changes in detail, I'm just gonna
> >> throw out a few things that I know would blow up when doing something
> >> like this, unless you have done something to fix that.
> >>
> >> The main theme is displaced mark words. Due to issues with displaced
> >> mark words, I don't currently know of any (good and performant) way
> >> of having any stable bits in the markWord that can be read by
> >> mutators, when there are displaced mark words. This is why in the
> >> generational version of ZGC we are currently building, we do not
> >> encode any age bits in the markWord. Instead it is encoded as a
> >> per-region/page property, which is more reliable, and can use much
> >> lighter synchronization. Anyway, here comes a few displaced mark word
> issues:
> >>
> >> 1) A displaced mark word can point into an ObjectMonitor. However,
> >> the monitor can be concurrently deflated, and subsequently freed. The
> >> safe memory reclamation policy for ObjectMonitor unlinks the monitors
> >> first, then performs a thread-local handshake with all (...?!)
> >> threads, and then frees them after the handshake, when it knows that
> >> surely nobody is looking at these monitors any longer. Except of
> >> course concurrent GC threads do not take part in such handshakes, and
> >> therefore, concurrent GC threads are suddenly unable to safely read
> >> klasses, through displaced mark words, pointing into concurrently
> >> freeing ObjectMonitors. It can result in use-after-free. They are
> >> basically not allowed to dereference ObjectMonitor, without more
> >> synchronization code to allow that.
> >>
> >> 2) A displaced mark word can also point into a stack lock, right into
> >> the stack of a concurrently running thread. Naturally, this thread
> >> can concurrently die, and its stack be deallocated, or concurrently
> >> mutated after the lock is released on that thread. In other words,
> >> the memory of the stack on other threads is completely unreliable.
> >> The way in which this works regarding hashCode, which similarly needs
> >> to be read by various parties, is that the stack lock is concurrently
> >> inflated into an inflated lock, which is then a bit more stable to
> >> read through, given the right sync dance. Assuming of course, that
> >> the reading thread, takes part in the global handshake for SMR purposes.
> >>
> >> So yeah, not sure if you have thought about any of this. If not, it
> >> might be the issue you are chasing after. It's worth mentioning that
> >> Robbin Ehn is currently removing displaced mark words with his Java
> >> monitors work. That should make this kind of exercise easier.
> >>
> >> Thanks,
> >> /Erik
> >>
> >> On 2021-07-20 13:47, Roman Kennke wrote:
> >>> Hi ZGC devs,
> >>>
> >>> I am struggling with a ZGC problem in Lilliput, and would like to
> >>> ask for your opinion.
> >>>
> >>> I'm currently working on changing runtime oopDesc::klass() to load
> >>> the Klass* from the object header instead of the dedicated Klass* field:
> >>>
> >>> https://urldefense.com/v3/__https://github.com/openjdk/lilliput/pull
> >>>
> /12__;!!ACWV5N9M2RV99hQ!aUxkQDk7iCst4N5JxPMKOMHYitKvSbW0BvuEmk
> 62FMNs
> >>> YTbEWHPMhpcl4YbuZ7rXELA$
> >>>
> >>> This required some coordination in other GCs, because it's not
> >>> always safe to access the object header. In particular, objects may
> >>> be locked, at which point we need to find the displaced header, or
> >>> worst case, inflate the header. I believe I've solved that in all GCs.
> >>>
> >>> However, I am still getting a failure with ZGC, which is kinda
> >>> unexpected, because it's the only GC that is *not* messing with
> >>> object headers (as far as I know. If you check out the above PR, the
> >>> failure can easily reproduced with:
> >>>
> >>> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java
> >>>
> >>> (and only that test is failing for me).
> >>>
> >>> The crash is in ZHeap::is_object_live() because the ZPage there
> >>> turns out to be NULL. I've added a bunch of debug output in that
> >>> location, and it looks like the offending object is always inflated
> >>> *and* forwarded when it happens, but I fail to see how this is
> >>> related to each other, and to the page being NULL. I strongly
> >>> suspect that inflation of the object header by calling klass() on it
> >>> causes the troubles. Changing back to original implementation of
> >>> oopDesc::klass() (swap commented-out-code there) makes the bug
> >>> disappear.
> >>>
> >>> Also, the bug always seems to happen when calling through a weak
> >>> barrier. Not sure if that is relevant.
> >>>
> >>> Any ideas? Opinions?
> >>>
> >>> Thanks,
> >>> Roman
> >>>
> >>