Need help with ZGC failure in Lilliput

Tue Jul 20 19:08:47 UTC 2021

Alright, I disabled monitor deflation, and the problem persists.

I narrowed the problem down to two (closely related) calls to 
ZUtils::object_size() (which in turn calls oopDesc::size() which calls 
oopDesc::klass()), which seem to trigger it: ZLiveMap::iterate() and 
ZRelocateClosure::relocate_object(). When I change those two calls to 
use the Klass* from the dedicated field, the failure disappears. I don't 
think that the actual Klass* disagrees (I have asserts to check that), 
which means it can only be the side effect of that call -- most likely 
the header inflation. How this inflation would affect ZGC is still not 
entirely clear to me.

The problem always happens during weak storage scan. Inflation of a lock 
creates new weak handle that points back to the object. I suspect that, 
at the very least, this must not point to the old copy of that object. I 
need to think this through.

Roman

> Hi Erik,
> 
> yes I have thought about this, but I am not sure if what I do is enough. 
> I'm basically following the logic that is implemented for hash-code: if 
> we encounter a stack-lock in the current thread, we can directly load 
> the displaced-header from it, otherwise it inflates the lock. I guess 
> that this is not really going perfect with GCs, because it means that, 
> e.g., concurrent marking or relocation would inflate some locks.
> 
> The trouble might indeed be that GC threads would not be allowed to do 
> that because of concurrent deflation. Hrmpf. Is there a way to prevent 
> deflation during certain GC phases maybe? Or coordinate GC threads with 
> deflation?
> 
> Robbin Ehn's work sounds promising. Can you give me more details about 
> what he's up to? Maybe a PR?
> 
> Thanks,
> Roman
> 
>> Hi Roman,
>>
>> Might need to catch up a bit with what you have done in detail. 
>> However, rather than studying your changes in detail, I'm just gonna 
>> throw out a few things that I know would blow up when doing something 
>> like this, unless you have done something to fix that.
>>
>> The main theme is displaced mark words. Due to issues with displaced 
>> mark words, I don't currently know of any (good and performant) way of 
>> having any stable bits in the markWord that can be read by mutators, 
>> when there are displaced mark words. This is why in the generational 
>> version of ZGC we are currently building, we do not encode any age 
>> bits in the markWord. Instead it is encoded as a per-region/page 
>> property, which is more reliable, and can use much lighter 
>> synchronization. Anyway, here comes a few displaced mark word issues:
>>
>> 1) A displaced mark word can point into an ObjectMonitor. However, the 
>> monitor can be concurrently deflated, and subsequently freed. The safe 
>> memory reclamation policy for ObjectMonitor unlinks the monitors 
>> first, then performs a thread-local handshake with all (...?!) 
>> threads, and then frees them after the handshake, when it knows that 
>> surely nobody is looking at these monitors any longer. Except of 
>> course concurrent GC threads do not take part in such handshakes, and 
>> therefore, concurrent GC threads are suddenly unable to safely read 
>> klasses, through displaced mark words, pointing into concurrently 
>> freeing ObjectMonitors. It can result in use-after-free. They are 
>> basically not allowed to dereference ObjectMonitor, without more 
>> synchronization code to allow that.
>>
>> 2) A displaced mark word can also point into a stack lock, right into 
>> the stack of a concurrently running thread. Naturally, this thread can 
>> concurrently die, and its stack be deallocated, or concurrently 
>> mutated after the lock is released on that thread. In other words, the 
>> memory of the stack on other threads is completely unreliable. The way 
>> in which this works regarding hashCode, which similarly needs to be 
>> read by various parties, is that the stack lock is concurrently 
>> inflated into an inflated lock, which is then a bit more stable to 
>> read through, given the right sync dance. Assuming of course, that the 
>> reading thread, takes part in the global handshake for SMR purposes.
>>
>> So yeah, not sure if you have thought about any of this. If not, it 
>> might be the issue you are chasing after. It's worth mentioning that 
>> Robbin Ehn is currently removing displaced mark words with his Java 
>> monitors work. That should make this kind of exercise easier.
>>
>> Thanks,
>> /Erik
>>
>> On 2021-07-20 13:47, Roman Kennke wrote:
>>> Hi ZGC devs,
>>>
>>> I am struggling with a ZGC problem in Lilliput, and would like to ask 
>>> for your opinion.
>>>
>>> I'm currently working on changing runtime oopDesc::klass() to load 
>>> the Klass* from the object header instead of the dedicated Klass* field:
>>>
>>> https://github.com/openjdk/lilliput/pull/12
>>>
>>> This required some coordination in other GCs, because it's not always 
>>> safe to access the object header. In particular, objects may be 
>>> locked, at which point we need to find the displaced header, or worst 
>>> case, inflate the header. I believe I've solved that in all GCs.
>>>
>>> However, I am still getting a failure with ZGC, which is kinda 
>>> unexpected, because it's the only GC that is *not* messing with 
>>> object headers (as far as I know. If you check out the above PR, the 
>>> failure can easily reproduced with:
>>>
>>> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java
>>>
>>> (and only that test is failing for me).
>>>
>>> The crash is in ZHeap::is_object_live() because the ZPage there turns 
>>> out to be NULL. I've added a bunch of debug output in that location, 
>>> and it looks like the offending object is always inflated *and* 
>>> forwarded when it happens, but I fail to see how this is related to 
>>> each other, and to the page being NULL. I strongly suspect that 
>>> inflation of the object header by calling klass() on it causes the 
>>> troubles. Changing back to original implementation of 
>>> oopDesc::klass() (swap commented-out-code there) makes the bug 
>>> disappear.
>>>
>>> Also, the bug always seems to happen when calling through a weak 
>>> barrier. Not sure if that is relevant.
>>>
>>> Any ideas? Opinions?
>>>
>>> Thanks,
>>> Roman
>>>
>>