RFR: 8277434: tests fail with "assert(is_forwarded()) failed: only decode when actually forwarded"

Aleksey Shipilev shade at openjdk.java.net
Mon Nov 29 08:11:04 UTC 2021

On Sun, 28 Nov 2021 03:30:25 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

> Please review this change which fixes a rare assert failure in
> G1ParScanThreadState::write_ref_field_post. The problem is that the assert is
> referencing values from the forwardee, and that isn't safe. When an object is
> copied and forwarded to the copy, the forwarding is performed with a relaxed
> cmpxchg. This means the data being copied into the forwardee might not be
> visible to a different thread that has obtained the forwardee. All uses of a
> forwardee must take care to avoid reading from it without performing
> additional synchronization. The assert in question violates that restriction.
> The incorrect assertion is
> assert(dest_attr.is_in_cset() == (obj->is_forwarded() && obj->forwardee() == obj), "...");
> obj is the forwardee of a different object.
> It's okay here to examine the contents of obj when it's in the cset.  In
> that case it's not a copy; it's the original object, self-forwarded due to
> evacuation failure.
> But it's not okay to examine the contents of obj when it's not in the cset,
> e.g. when it's a forwardee copy of the original object.  Doing so violates
> the above restriction.
> So that tells us this code is wrong, but it's still not obvious this is the
> cause of the failure in question.  The failure we're seeing appears to be
> that obj->is_forwarded() returns true, but the assert of the same thing in
> forwardee returns false.  But is_forwarded() should never be true for a
> copied forwardee.  But it turns out a copied forwardee might temporarily
> appear to be forwarded, due to the unordered copy + forward sequence.
> Consider the following scenario:
> A thread is evacuating an object and allocates a candidate forwardee, copies
> into it, and finally discovers the forwarding race was lost even before the
> copy.  The candidate forwardee appears to be marked as forwarded.  That
> candidate is deallocated and the real forwardee is used by the thread.
> That same thread evacuates a second object, reallocating some of the same
> memory as from the previous candidate.  It copies the second object into the
> new forwardee, successfully performs the forwarding, and uses that newly
> created forwardee.
> A second thread attempts to evacuate that same second object, and finds it
> already forwarded.  It goes into write_ref_field_post, reaches the assert,
> and attempts to use values from the forwardee.  Because the copy and the
> forwarding by the first thread were unordered, this second thread might see
> the old copy that appears to be forwarded, rather than the up to date copy.
> So the first call to is_forwarded() returns true.  The forwardee() assert of
> the same thing might instead get the up to date contents, resulting in the
> reported failure.  Alternatively, it might still get old data, in which case
> nothing gets noticed because the self-forward check will fail and the
> write_ref_field_post assert will (accidentally) succeed.
> So there's a very small window in which the reported failure can occur.  But
> it's all caused by the write_ref_field_post assert performing invalid
> accesses.  Stop doing that and the problem should go away.
> I think this failure couldn't have happened before JDK-8271579, but that's
> kind of accidental and doesn't change that the write_ref_field_post assert
> was performing invalid accesses.  (Those accesses were introduced later.)
> Testing:
> mach5 tier1-5.
> Ran applications/jcstress/acqrel.java 100 times without failure.
> Without the change I had a 25-30% failure rate for that test.
> I don't know why that test was such a good canary, when this failure seems to
> otherwise be pretty rare, but glad that it was.

Ooof, so a relaxed forwardee installation comes with an odd invariant like this. I linked JDK-8204524 to this issue, to track where the forwardee installation was relaxed. 

The fix for this particular issue looks fine to me, thanks.

A few follow-up questions about the general issue here. How is it not affecting the other places where G1 accesses `obj->forwardee()`, like in `G1ScanClosureBase::prefetch_and_push` or `G1ParScanThreadState::do_partial_array`? Also, it is not theoretically clear how would "All uses of a forwardee must take care to avoid reading from it without performing additional synchronization", given that the trouble had already happened on writer side, and there is little a reader can do to recover?


Marked as reviewed by shade (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/6582

More information about the hotspot-gc-dev mailing list