RFR: 8232782: Shenandoah: streamline post-LRB CAS barrier (aarch64) (version 2)

Thu Jul 2 15:34:31 UTC 2020

Hi Kelvin,

>  1. As you suggest, remove the argument from this function and fix up
> all the call points.  The x86 version does not have it for example.

Yeah, x86 doesn't have it because there's no weak CAS in x86 hardware.

>  2. Allow the argument, but ignore its value.  Always behave as if
> !weak.

Both caller and the implementation are 'ours', we can just as well get
rid of the argument.

>  3. Modify my draft patch so that I distinguish between the cause of
> cmpxchg failure if weak.  This would involve conditional branching.

Eww, I doubt that this brings any benefit.

>  4. A fourth option generalizes on the third.  The "problem" with the
> third option is that I introduce new code outside cmpxchg to
> rediscover what cmpxchg already knows.  If I were to in-line the
> behavior of cmpxchg, I could adjust the branch destinations so that I
> don't have to rediscover the lost information based on new condition
> tests.
> 5. Or, maybe there already exists a subsequent peephole optimization
> pass that achieves the effect of solution 4 with code generated
> according to solution 3.  I'm still very new to HotSpot architecture
> so am not familiar with how all the pieces fit together.
> 

I think 4 + 5 would require that we have any control over the caller of
weakCompareAndSwap(), which we haven't. This is Java code that is
compiled to IR.

I'd say we have no easy way to get any value out of weak, let's just
ignore it, treat it as strong CAS, and remove the argument altogether. 

There is one additional wrinkle afaict: MacroAssembler::cmpxchg() has a
retry loop itself, for the case when it sees !weak and !UseLSE, maybe
we can make a tiny optimization there to avoid that loop and handle it
ourselves? Not sure it would be worth the troubles, though.

Roman

> Do you have any further suggestions regarding which option is
> preferred?
> 
> On 7/2/20, 1:50 AM, "Roman Kennke" <rkennke at redhat.com> wrote:
> 
>     Hi Kelvin,
> 
>     > Thanks for the careful review.  I'll make another round with
> it.
>     >
>     > I wasn’t entirely sure what the preconditions for this function
> were
>     > so I documented my best guess.  Good thing you reviewed my
> comments
>     > carefully because apparently I guessed wrong.
>     >
>     > Regarding use of the weak Boolean argument, it was not clear to
> me
>     > why a caller would want to specify weak, but I assumed if they
> did
>     > request it, I should go ahead and honor it.  Are you suggesting
> I
>     > should ignore the value of the weak argument and always behave
> as if
>     > the caller had requested not weak?
> 
>     That is a good question. Let me answer the next one first, maybe
> it
>     becomes clearer then.
> 
>     > Also, wil you check my understanding further?   I believe this
>     > service (cmpxchg_oop) is used primarily to resolve races
> between
>     > background GC threads and mutator threads.  There is no need to
> need
>     > for the JVM to resolve races between multiple mutator
> threads.  That
>     > is a programmer responsibility. The reason for cmpxchg_oop is
> to
>     > resolve the following two race conditions:
>     >
>     >   1. A mutator thread is overwriting a field that the GC may be
>     > overwriting in parallel. The mutator needs to use this protocol
> to
>     > overwrite pointers in order to prevent the following race:
>     >         i. GC fetches from-space pointer at addr
>     >        ii. GC computes equivalent to-space pointer
>     >       iii. Mutator overwrites pointer at addr with new_val
>     >       iv. GC overwrites pointer at addr with to-space
> replacement of
>     > original from-space value, incorrectly clobbering the mutation!
>     >      The GC thread uses cmpxchg_oop to abandon its overwrite
> attempt
>     > in case the value has changed since it began its effort.
>     >
> 
>     That is not quite right. That race is solved by the GC thread
> using a
>     simple CAS. See here:
> 
>     
> https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp#l180
> 
>     >   2. A mutator thread fetches a pointer from memory and the LRB
>     > discovers that the pointer refers to from-space memory.  After
>     > determining the to-space pointer that represents the new
> location of
>     > the object originally referenced by the fetched from-space
> pointer,
>     > the mutator attempts to heal the value held in memory by
> overwriting
>     > that memory with the to-space equivalent.  The healing effort
> must
>     > use cmpxchg_oop to resolve the exact same race condition that
> might
>     > occur when a background GC thread is healing a from-space
> pointer
>     > residing in memory.
>     >
> 
>     That race is also solved by a simple CAS:
>     
> https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahBarrierSet.inline.hpp#l73
> 
>     > In both of the above cases, we know that GC is active and both
>     > expected and new_val are not equal to null.  We also know for
> both of
>     > these cases, by the way, that expected refers to from-space, so
> maybe
>     > our implementation of cmpxchg_oop should really be starting
> with step
>     > 2!
>     >
>     > Can you help me understand the other situations in which
> cmpxchg_oop
>     > will be called, under which expected and new_val may equal
> null?
> 
>     The assembler cmpxchg_oop() routine is only used by the C1 and C2
>     compilers to implement the intrinsics for
>     Unsafe.compareAndSwapReference() and the whole family of related
>     methods. There is also a weakCompareAndSwapReference(), which is
> where
>     the weak variant comes into play. The problem that it is trying
> to
>     solve is the false negative that arises when the value in memory
> is a
>     from-space reference, but the expected value is a to-space
> reference.
>     In this case, a simple CAS would fail, even though it shouldn't.
> An
>     additional problem is that, when we retry the CAS with the from-
> space
>     ref, another thread (mutator or GC) may update the field to now
> point
>     to the to-space ref of the same object, in which case the 2nd CAS
> would
>     fail too, and we need to retry a 3rd time. That 3rd case is very
> rare.
> 
>     It is entirely possible to see NULL refs there, both in-memory
> and also
>     for the expected-value. The compiler can optimize the case when
> it can
>     statically determine that the expected-value is NULL, and
> generate a
>     simple CAS then, but NULLs can still happen at runtime, and we
> need to
>     handle this, either by explicit short-cut or by dealing with it
> when
>     resolving the object (that is what we currently do).
> 
>     Now, about the weak property. I am really not sure what to do.
>     Technically we'd be allowed to produce false negatives with weak-
> CAS,
>     but it is not intended to consistently produce false negatives
>     repeatedly. It basically caters for the platforms that can use
> LL/SC
>     pair which may fail sporadically and would only ever be used in a
> loop.
>     If we'd do weak-CAS as simple CAS without our protocol, that
> would be
>     legal, and should work, but it would send mutator threads in
> fairly
>     long retry-loops until another thread heals the reference. OTOH,
> our
>     protocol already does the retry-loop and turns any weak-CAS into
> a
>     strong-CAS. So we can just as well ignore (or remove) the weak
>     argument.
> 
>     Does it clarify the situation for you?
>     Roman
> 
> 
>     > Thanks.
>     >
>     >
>     >
>     > On 7/1/20, 3:56 PM, "Roman Kennke" <rkennke at redhat.com> wrote:
>     >
>     >     CAUTION: This email originated from outside of the
> organization.
>     > Do not click links or open attachments unless you can confirm
> the
>     > sender and know the content is safe.
>     >
>     >
>     >
>     >     Ah now I remember why we want to do the CAS barrier even
> for
>     > weak-CAS:
>     >     while it is technically true that calling code must be
> prepared
>     > to deal
>     >     with occasional spurious failures, we probably don't want
> it to
>     > get
>     >     stuck in retry-loop until GC gets to update the field -
> which may
>     > be a
>     >     fairly long period of time. Not sure if we can draw any
> other
>     > benefit
>     >     from 'weak' here?
>     >
>     >     Another *possible* optimization (while we're at it) is to
> check
>     > if GC
>     >     is actually active (HAS_FORWARDED, see LRB impl) and avoid
> the
>     > whole
>     >     decoding of failure-witness business. This may be something
> to
>     > keep in
>     >     mind should we ever find out that this code affects us
> outside of
>     > GC
>     >     cycles.
>     >
>     >     Roman
>     >
>     >     On Thu, 2020-07-02 at 00:20 +0200, Roman Kennke wrote:
>     >     > Hi Kelvin,
>     >     >
>     >     > I had something similar in mind with x86, but Aleksey
> stopped
>     > me :-)
>     >     > It
>     >     > may be worth noting that we need a maximum of 3
> consequtive
>     > CASes to
>     >     > cover all cases, it is not necessary to make a loop:
>     >     >
>     >     > 1. Fast-path CAS#1
>     >     > If that fails, it *may* be because of value-in-memory
> being a
>     > from-
>     >     > space-ref. Check that and do:
>     >     > 2. CAS#2 with expected = previous from CAS#1
>     >     > If that fails, it may be because a competing thread just
> wrote
>     > a to-
>     >     > space-ref of our expected object (i.e. our *original*
> expected
>     >     > value),
>     >     > try again:
>     >     > 3. CAS#3 with expected = previous from CAS#2 (== original
>     > expected)
>     >     > at that point the CAS cannot fail because of false
> negative
>     > because
>     >     > of
>     >     > to-space-invariant
>     >     >
>     >     > Correct me if I am wrong here!
>     >     >
>     >     > Ah I see you optimized for smaller code there:
>     >     > +  // It is extremely rare we reach this point.  For this
>     > reason, the
>     >     > +  // implementation opts for smaller rather than
> potentially
>     > faster
>     >     > +  // code.  Ultimately, smaller code for this rare case
> most
>     > likely
>     >     > +  // delivers higher overall throughput by enabling
> improved
>     > icache
>     >     > +  // performance.
>     >     > Good then.
>     >     >
>     >     > Some comments on the patch:
>     >     >
>     >     > // It is required that addr represent a
>     >     > +// memory location at which a non-null reference value
> is
>     > stored and
>     >     > that
>     >     > +// expected holds a non-null reference value.
>     >     >
>     >     > Is this true? I believe the memory location may hold
> NULL. The
>     >     > expected
>     >     > value may hold NULL too, in which case we can skip the
> barrier.
>     > The
>     >     > compiler optimizes the case where it *knows statically*
> that
>     > expected
>     >     > == NULL to not emit the barrier. We also should have run-
> time
>     > checks
>     >     > for these situations that skip to 'done' on first CAS-
> failure
>     > with
>     >     > NULL
>     >     > for the case when compiler cannot determine statically
> NULL, or
>     >     > alternatively don't use the _not_null() variants of
>     > encode/decode
>     >     > methods.
>     >     >
>     >     > +//  weak:    This relaxes the "strong" property so that
> CAS is
>     >     > allowed
>     >     > +//           to fail even when the expected value is
> present
>     > in
>     >     > memory.
>     >     > +//           This is useful for load-with-lock, store-
>     > conditional
>     >     > +//           loops where certain failures require
> retries.  If
>     > weak
>     >     > is
>     >     > +//           enabled, it is ok to return failure rather
> than
>     >     > retrying.
>     >     >
>     >     > If we are certain that the meaning of 'weak' allows us to
> skip
>     > the
>     >     > barrier, we can change/remove our corresponding
> implementation
>     > of the
>     >     > weak paths to emit a plain CAS. Right?
>     >     >
>     >     > +  // Try to CAS with given arguments using LL/SC
> pair.  If
>     >     > successful,
>     >     > +  // then we are done.
>     >     >
>     >     > LL/SC pair is not (generally) true anymore with this
> updated
>     > patch.
>     >     >
>     >     > Other than that, I like the impl.
>     >     >
>     >     > I'd actually like to see a similar implementation in x86,
> I
>     > believe
>     >     > it
>     >     > should be slightly more efficient. However, I'm not sure
> it's
>     > really
>     >     > worth - I don't think I've ever seen this code path very
> hot
>     >     > anywhere.
>     >     > We'd need to convince Aleksey though - this code has
> undergone
>     > a
>     >     > couple
>     >     > of changes already and it's basically always caused
> unforeseen
>     >     > troubles
>     >     > (weird register clashes, extremely rare corner cases like
> the
>     > above
>     >     > mentioned 3rd CAS, etc)
>     >     >
>     >     > Thank you!
>     >     > Roman
>     >     >
>     >     >
>     >     > On Tue, 2020-06-30 at 22:39 +0000, Nilsen, Kelvin wrote:
>     >     > > Thank you for feedback from previously distributed
> draft
>     >     > > patch.  This
>     >     > > new patch is similar to the patch distributed on June
>     > 24.  However,
>     >     > > this version uses MacroAssembler::cmpxchng() instead of
> hard-
>     > coding
>     >     > > the use of ldxr/stxr instructions.
>     >     > >
>     >     > > See
>     > http://cr.openjdk.java.net/~kdnilsen/JDK-8232782/webrev.02/
>     >     > >
>     >     > > This patch addresses the problem described in
>     >     > > https://bugs.openjdk.java.net/browse/JDK-8232782
>     >     > >
>     >     > > The implementation mimics the behavior of the recently
>     > revised x86
>     >     > > implementation of cmpxchg_oop with slight refinements:
>     >     > >
>     >     > > X86 version:
>     >     > > Step 1: Try CAS
>     >     > > Step 2: if CAS fails, check if original memory holds
>     > equivalent
>     >     > > from-
>     >     > > space pointer
>     >     > > Step 3: Use CAS to overwrite memory with equivalent to-
> space
>     >     > > pointer
>     >     > > Step 4: Try CAS again
>     >     > > Step 5: Return boolean result to indicate success or
> failure
>     >     > >
>     >     > > AARCH64 version:
>     >     > > Step 1: Try CAS
>     >     > > Step 2: if CAS fails, check if original memory holds
>     > equivalent
>     >     > > from-
>     >     > > space pointer
>     >     > > Step 3 (differs): Do not overwrite memory with
> equivalent to-
>     > space
>     >     > > pointer,  Instead, run the original CAS request with
> from-
>     > space
>     >     > > pointer as the "expected" value. If this succeeds,
> we're
>     > done.  If
>     >     > > this fails, go back to step 1 and try that again.
>     >     > >
>     >     > > Step 5: Return boolean result to indicate success or
> failure
>     >     > >
>     >     > > This patch satisfies tier1, tier2, and
> hotspot_gc_shenandoah
>     >     > > regression tests on Ubuntu 18.04.4 LTS (GNU/Linux
> 5.3.0-1023-
>     > aws
>     >     > > aarch64).  I have also run an "extreme" garbage
> collection
>     > workload
>     >     > > for 20 minutes without problem.
>     >     > >
>     >     > > Is this ok to merge?
>     >     > >
>     >     > >
>     >     > >
>     >
>     >
> 
>