RFR: 8232782: Shenandoah: streamline post-LRB CAS barrier (aarch64) (version 2)

Thu Jul 2 08:49:45 UTC 2020

Hi Kelvin,

> Thanks for the careful review.  I'll make another round with it.
> 
> I wasn’t entirely sure what the preconditions for this function were
> so I documented my best guess.  Good thing you reviewed my comments
> carefully because apparently I guessed wrong.
> 
> Regarding use of the weak Boolean argument, it was not clear to me
> why a caller would want to specify weak, but I assumed if they did
> request it, I should go ahead and honor it.  Are you suggesting I
> should ignore the value of the weak argument and always behave as if
> the caller had requested not weak?

That is a good question. Let me answer the next one first, maybe it
becomes clearer then.

> Also, wil you check my understanding further?   I believe this
> service (cmpxchg_oop) is used primarily to resolve races between
> background GC threads and mutator threads.  There is no need to need
> for the JVM to resolve races between multiple mutator threads.  That
> is a programmer responsibility. The reason for cmpxchg_oop is to
> resolve the following two race conditions:
> 
>   1. A mutator thread is overwriting a field that the GC may be
> overwriting in parallel. The mutator needs to use this protocol to
> overwrite pointers in order to prevent the following race:
>         i. GC fetches from-space pointer at addr
>        ii. GC computes equivalent to-space pointer
>       iii. Mutator overwrites pointer at addr with new_val
>       iv. GC overwrites pointer at addr with to-space replacement of
> original from-space value, incorrectly clobbering the mutation!
>      The GC thread uses cmpxchg_oop to abandon its overwrite attempt
> in case the value has changed since it began its effort.
> 

That is not quite right. That race is solved by the GC thread using a
simple CAS. See here:

https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp#l180

>   2. A mutator thread fetches a pointer from memory and the LRB
> discovers that the pointer refers to from-space memory.  After
> determining the to-space pointer that represents the new location of
> the object originally referenced by the fetched from-space pointer,
> the mutator attempts to heal the value held in memory by overwriting
> that memory with the to-space equivalent.  The healing effort must
> use cmpxchg_oop to resolve the exact same race condition that might
> occur when a background GC thread is healing a from-space pointer
> residing in memory.
> 

That race is also solved by a simple CAS:
https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahBarrierSet.inline.hpp#l73

> In both of the above cases, we know that GC is active and both
> expected and new_val are not equal to null.  We also know for both of
> these cases, by the way, that expected refers to from-space, so maybe
> our implementation of cmpxchg_oop should really be starting with step
> 2!
> 
> Can you help me understand the other situations in which cmpxchg_oop
> will be called, under which expected and new_val may equal null?

The assembler cmpxchg_oop() routine is only used by the C1 and C2
compilers to implement the intrinsics for
Unsafe.compareAndSwapReference() and the whole family of related
methods. There is also a weakCompareAndSwapReference(), which is where
the weak variant comes into play. The problem that it is trying to
solve is the false negative that arises when the value in memory is a
from-space reference, but the expected value is a to-space reference.
In this case, a simple CAS would fail, even though it shouldn't. An
additional problem is that, when we retry the CAS with the from-space
ref, another thread (mutator or GC) may update the field to now point
to the to-space ref of the same object, in which case the 2nd CAS would
fail too, and we need to retry a 3rd time. That 3rd case is very rare.

It is entirely possible to see NULL refs there, both in-memory and also
for the expected-value. The compiler can optimize the case when it can
statically determine that the expected-value is NULL, and generate a
simple CAS then, but NULLs can still happen at runtime, and we need to
handle this, either by explicit short-cut or by dealing with it when
resolving the object (that is what we currently do).

Now, about the weak property. I am really not sure what to do.
Technically we'd be allowed to produce false negatives with weak-CAS,
but it is not intended to consistently produce false negatives
repeatedly. It basically caters for the platforms that can use LL/SC
pair which may fail sporadically and would only ever be used in a loop.
If we'd do weak-CAS as simple CAS without our protocol, that would be
legal, and should work, but it would send mutator threads in fairly
long retry-loops until another thread heals the reference. OTOH, our
protocol already does the retry-loop and turns any weak-CAS into a
strong-CAS. So we can just as well ignore (or remove) the weak
argument.

Does it clarify the situation for you?
Roman

> Thanks.
> 
> 
> 
> On 7/1/20, 3:56 PM, "Roman Kennke" <rkennke at redhat.com> wrote:
> 
>     CAUTION: This email originated from outside of the organization.
> Do not click links or open attachments unless you can confirm the
> sender and know the content is safe.
> 
> 
> 
>     Ah now I remember why we want to do the CAS barrier even for
> weak-CAS:
>     while it is technically true that calling code must be prepared
> to deal
>     with occasional spurious failures, we probably don't want it to
> get
>     stuck in retry-loop until GC gets to update the field - which may
> be a
>     fairly long period of time. Not sure if we can draw any other
> benefit
>     from 'weak' here?
> 
>     Another *possible* optimization (while we're at it) is to check
> if GC
>     is actually active (HAS_FORWARDED, see LRB impl) and avoid the
> whole
>     decoding of failure-witness business. This may be something to
> keep in
>     mind should we ever find out that this code affects us outside of
> GC
>     cycles.
> 
>     Roman
> 
>     On Thu, 2020-07-02 at 00:20 +0200, Roman Kennke wrote:
>     > Hi Kelvin,
>     >
>     > I had something similar in mind with x86, but Aleksey stopped
> me :-)
>     > It
>     > may be worth noting that we need a maximum of 3 consequtive
> CASes to
>     > cover all cases, it is not necessary to make a loop:
>     >
>     > 1. Fast-path CAS#1
>     > If that fails, it *may* be because of value-in-memory being a
> from-
>     > space-ref. Check that and do:
>     > 2. CAS#2 with expected = previous from CAS#1
>     > If that fails, it may be because a competing thread just wrote
> a to-
>     > space-ref of our expected object (i.e. our *original* expected
>     > value),
>     > try again:
>     > 3. CAS#3 with expected = previous from CAS#2 (== original
> expected)
>     > at that point the CAS cannot fail because of false negative
> because
>     > of
>     > to-space-invariant
>     >
>     > Correct me if I am wrong here!
>     >
>     > Ah I see you optimized for smaller code there:
>     > +  // It is extremely rare we reach this point.  For this
> reason, the
>     > +  // implementation opts for smaller rather than potentially
> faster
>     > +  // code.  Ultimately, smaller code for this rare case most
> likely
>     > +  // delivers higher overall throughput by enabling improved
> icache
>     > +  // performance.
>     > Good then.
>     >
>     > Some comments on the patch:
>     >
>     > // It is required that addr represent a
>     > +// memory location at which a non-null reference value is
> stored and
>     > that
>     > +// expected holds a non-null reference value.
>     >
>     > Is this true? I believe the memory location may hold NULL. The
>     > expected
>     > value may hold NULL too, in which case we can skip the barrier.
> The
>     > compiler optimizes the case where it *knows statically* that
> expected
>     > == NULL to not emit the barrier. We also should have run-time
> checks
>     > for these situations that skip to 'done' on first CAS-failure
> with
>     > NULL
>     > for the case when compiler cannot determine statically NULL, or
>     > alternatively don't use the _not_null() variants of
> encode/decode
>     > methods.
>     >
>     > +//  weak:    This relaxes the "strong" property so that CAS is
>     > allowed
>     > +//           to fail even when the expected value is present
> in
>     > memory.
>     > +//           This is useful for load-with-lock, store-
> conditional
>     > +//           loops where certain failures require retries.  If
> weak
>     > is
>     > +//           enabled, it is ok to return failure rather than
>     > retrying.
>     >
>     > If we are certain that the meaning of 'weak' allows us to skip
> the
>     > barrier, we can change/remove our corresponding implementation
> of the
>     > weak paths to emit a plain CAS. Right?
>     >
>     > +  // Try to CAS with given arguments using LL/SC pair.  If
>     > successful,
>     > +  // then we are done.
>     >
>     > LL/SC pair is not (generally) true anymore with this updated
> patch.
>     >
>     > Other than that, I like the impl.
>     >
>     > I'd actually like to see a similar implementation in x86, I
> believe
>     > it
>     > should be slightly more efficient. However, I'm not sure it's
> really
>     > worth - I don't think I've ever seen this code path very hot
>     > anywhere.
>     > We'd need to convince Aleksey though - this code has undergone
> a
>     > couple
>     > of changes already and it's basically always caused unforeseen
>     > troubles
>     > (weird register clashes, extremely rare corner cases like the
> above
>     > mentioned 3rd CAS, etc)
>     >
>     > Thank you!
>     > Roman
>     >
>     >
>     > On Tue, 2020-06-30 at 22:39 +0000, Nilsen, Kelvin wrote:
>     > > Thank you for feedback from previously distributed draft
>     > > patch.  This
>     > > new patch is similar to the patch distributed on June
> 24.  However,
>     > > this version uses MacroAssembler::cmpxchng() instead of hard-
> coding
>     > > the use of ldxr/stxr instructions.
>     > >
>     > > See 
> http://cr.openjdk.java.net/~kdnilsen/JDK-8232782/webrev.02/
>     > >
>     > > This patch addresses the problem described in
>     > > https://bugs.openjdk.java.net/browse/JDK-8232782
>     > >
>     > > The implementation mimics the behavior of the recently
> revised x86
>     > > implementation of cmpxchg_oop with slight refinements:
>     > >
>     > > X86 version:
>     > > Step 1: Try CAS
>     > > Step 2: if CAS fails, check if original memory holds
> equivalent
>     > > from-
>     > > space pointer
>     > > Step 3: Use CAS to overwrite memory with equivalent to-space
>     > > pointer
>     > > Step 4: Try CAS again
>     > > Step 5: Return boolean result to indicate success or failure
>     > >
>     > > AARCH64 version:
>     > > Step 1: Try CAS
>     > > Step 2: if CAS fails, check if original memory holds
> equivalent
>     > > from-
>     > > space pointer
>     > > Step 3 (differs): Do not overwrite memory with equivalent to-
> space
>     > > pointer,  Instead, run the original CAS request with from-
> space
>     > > pointer as the "expected" value. If this succeeds, we're
> done.  If
>     > > this fails, go back to step 1 and try that again.
>     > >
>     > > Step 5: Return boolean result to indicate success or failure
>     > >
>     > > This patch satisfies tier1, tier2, and hotspot_gc_shenandoah
>     > > regression tests on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1023-
> aws
>     > > aarch64).  I have also run an "extreme" garbage collection
> workload
>     > > for 20 minutes without problem.
>     > >
>     > > Is this ok to merge?
>     > >
>     > >
>     > >
> 
>