RFR: 8256020: Don't resurrect objects on argument-dependency access

Tue Nov 10 16:17:06 UTC 2020

On Tue, 10 Nov 2020 14:08:53 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> In Shenandoah-testing, we noticed that compiler/jsr292/CallSiteDepContextTest.java fails with the following error:
>> 
>> CONF=linux-x86_64-server-fastdebug make run-test TEST=compiler/jsr292/CallSiteDepContextTest.java TEST_VM_OPTS="-XX:+UseShenandoahGC -XX:+ShenandoahVerify"
>> 
>> #  Internal Error (/home/rkennke/src/openjdk/jdk/src/hotspot/share/gc/shenandoah/shenandoahVerifier.cpp:92), pid=2318905, tid=2318938
>> #  Error: Before Updating References, Marked; Must be marked in complete bitmap
>> 
>> Referenced from:
>>   interior location: 0x00000000fff8514c
>>   0x00000000fff85140 - klass 0x000000010004ecd8 java.lang.invoke.MutableCallSite
>>         allocated after mark start
>>     not after update watermark
>>         marked strong
>>         marked weak
>>     not in collection set
>>   mark: mark(is_neutral no_hash age=0)
>>   region: | 2565|R  |BTE     fff80000,     fffc0000,     fffc0000|TAMS     fff80000|UWM     fffc0000|U   256K|T     0B|G   256K|S     0B|L     0B|CP   0
>> 
>> Object:
>>   0x00000000d80a9210 - klass 0x000000010004cf58 java.lang.invoke.DirectMethodHandle
>>     not allocated after mark start
>>     not after update watermark
>>     not marked strong
>>     not marked weak
>>         in collection set
>>   mark: mark(is_neutral no_hash age=0)
>>   region: |    9|CS |BTE     d8080000,     d80c0000,     d80c0000|TAMS     d80c0000|UWM     d80c0000|U   256K|T   256K|G     0B|S     0B|L 22464B|CP   0
>> 
>> Forwardee:
>>   (the object itself)
>> 
>> In other words, a reachable (marked) MutableCallSite references an unreachable DirectMethodHandle. That reference would subsequently become dangling and lead to crashes if accessed.
>> 
>> I narrowed it down to the access in Dependencies::DepStream::recorded_oop_at(int i) which is done as 'strong', which means that it would return the reference even if it is unreachable, e.g. during concurrent class-unloading. This resurrection of the unreachable DMH is potentially fatal: eventually the reference will become dangling (after GC) and lead to crashes when accessed. I believe that access should be 'phantom' instead which causes GCs like Shenandoah and ZGC to return NULL when encountering unreachable objects. 
>> 
>> (Notice that the bug only manifested after JDK-8255691, we accidentally applied the resurrection-preventing weak-LRB on strong access too)
>> 
>> Testing: the offending CallSiteDepContextTest.java, tier1+UseShenandoahGC+ShenandoahVerify, tier2+UseShenandoahGC+ShenandoahVerify, hotspot_gc_shenandoah
>
> So your theory is that someone calls Dependencies::DepStream::recorded_oop_at on an nmethod, after marking terminated, leaking out a dead object. For your theory to be true, you would have acquired a is_unloading() nmethod from somewhere, and called Dependencies::DepStream::recorded_oop_at on it. That immediately excludes e.g. all on-stack nmethods, all nmethods handed out through dependency contexts, and all nmethods handed out through only_alive_and_not_unloading CodeCache iterators, which is almost all of them. There are very few code cache iterations that expose is_unloading() nmethods, and what they have in common is that they are *not* poking around at oops.
> 
> So I suppose I really don't understand what path you could possibly track this to happen, where you have an is_unloading() nmethod, and start poking around at its oops. Would you mind elaborating a bit more, from what context you think Dependencies::DepStream::recorded_oop_at() is being called on an is_unloading() nmethod?

> So your theory is that someone calls Dependencies::DepStream::recorded_oop_at on an nmethod, after marking terminated, leaking out a dead object. For your theory to be true, you would have acquired a is_unloading() nmethod from somewhere, and called Dependencies::DepStream::recorded_oop_at on it. That immediately excludes e.g. all on-stack nmethods, all nmethods handed out through dependency contexts, and all nmethods handed out through only_alive_and_not_unloading CodeCache iterators, which is almost all of them. There are very few code cache iterations that expose is_unloading() nmethods, and what they have in common is that they are _not_ poking around at oops.
> 
> So I suppose I really don't understand what path you could possibly track this to happen, where you have an is_unloading() nmethod, and start poking around at its oops. Would you mind elaborating a bit more, from what context you think Dependencies::DepStream::recorded_oop_at() is being called on an is_unloading() nmethod?

I am digging a little deeper. Planting an assert in relevant place barrier shows that we are exposing an unmarked object in this code path in nmethod::flush_dependencies():
        oop call_site = deps.argument_oop(0);
        if (delete_immediately) {
          assert_locked_or_safepoint(CodeCache_lock);
          MethodHandles::remove_dependent_nmethod(call_site, this);
        } else {
          MethodHandles::clean_dependency_context(call_site);
        }

#  Internal Error (/home/rkennke/src/openjdk/jdk/src/hotspot/share/gc/shenandoah/shenandoahBarrierSet.inline.hpp:112), pid=2396340, tid=2396367
#  assert(obj == __null || !_heap->is_concurrent_weak_root_in_progress() || _heap->marking_context()->is_marked(obj)) failed: only expose marked objects

Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x626fa0]  AccessInternal::PostRuntimeDispatch<ShenandoahBarrierSet::AccessBarrier<544868ul, ShenandoahBarrierSet>, (AccessInternal::BarrierType)2, 544868ul>::oop_access_barrier(void*)+0x290
V  [libjvm.so+0x132b3ea]  nmethod::oop_at(int) const+0x4a
V  [libjvm.so+0x9e862e]  Dependencies::DepStream::argument_oop(int)+0x7e
V  [libjvm.so+0x132fe21]  nmethod::flush_dependencies(bool)+0x1f1
V  [libjvm.so+0x1580e79]  ShenandoahNMethodUnlinkClosure::do_nmethod(nmethod*)+0x3d9
V  [libjvm.so+0x16113fc]  ShenandoahNMethodTableSnapshot::concurrent_nmethods_do(NMethodClosure*)+0x8c
V  [libjvm.so+0x157f19b]  ShenandoahUnlinkTask::work(unsigned int)+0x2b
V  [libjvm.so+0x18faa54]  GangWorker::run_task(WorkData)+0x84
V  [libjvm.so+0x18fabb3]  GangWorker::loop()+0x63
V  [libjvm.so+0x17b6008]  Thread::call_run()+0xf8
V  [libjvm.so+0x13b6d2e]  thread_native_entry(Thread*)+0x10e

In other words, it gets the unmarked call_site, then does access it afterwards. TBH, I am not totally sure that we aren't doing something wrong somewhere. ?

-------------

PR: https://git.openjdk.java.net/jdk/pull/1113