RFR: 8232782: Shenandoah: streamline post-LRB CAS barrier (aarch64) (version 2)

Thu Jul 2 15:05:14 UTC 2020

Thank you very much.  This is very helpful.

I see several options regarding how to handle the weak argument:

 1. As you suggest, remove the argument from this function and fix up all the call points.  The x86 version does not have it for example.
 2. Allow the argument, but ignore its value.  Always behave as if !weak.
 3. Modify my draft patch so that I distinguish between the cause of cmpxchg failure if weak.  This would involve conditional branching.  If EQ flag is on, indicate success, even if weak.  But if EQ flag is off, I need to check the witness value. If weak and the witness value equals expected, then cmpxchg failed due to loss of lock so we terminate early weak failure.  But if the witness value does not equal expected, then this failure requires that I proceed to the "next step".
 4. A fourth option generalizes on the third.  The "problem" with the third option is that I introduce new code outside cmpxchg to rediscover what cmpxchg already knows.  If I were to in-line the behavior of cmpxchg, I could adjust the branch destinations so that I don't have to rediscover the lost information based on new condition tests.
5. Or, maybe there already exists a subsequent peephole optimization pass that achieves the effect of solution 4 with code generated according to solution 3.  I'm still very new to HotSpot architecture so am not familiar with how all the pieces fit together.

Do you have any further suggestions regarding which option is preferred?

On 7/2/20, 1:50 AM, "Roman Kennke" <rkennke at redhat.com> wrote:

    Hi Kelvin,

    > Thanks for the careful review.  I'll make another round with it.
    >
    > I wasn’t entirely sure what the preconditions for this function were
    > so I documented my best guess.  Good thing you reviewed my comments
    > carefully because apparently I guessed wrong.
    >
    > Regarding use of the weak Boolean argument, it was not clear to me
    > why a caller would want to specify weak, but I assumed if they did
    > request it, I should go ahead and honor it.  Are you suggesting I
    > should ignore the value of the weak argument and always behave as if
    > the caller had requested not weak?

    That is a good question. Let me answer the next one first, maybe it
    becomes clearer then.

    > Also, wil you check my understanding further?   I believe this
    > service (cmpxchg_oop) is used primarily to resolve races between
    > background GC threads and mutator threads.  There is no need to need
    > for the JVM to resolve races between multiple mutator threads.  That
    > is a programmer responsibility. The reason for cmpxchg_oop is to
    > resolve the following two race conditions:
    >
    >   1. A mutator thread is overwriting a field that the GC may be
    > overwriting in parallel. The mutator needs to use this protocol to
    > overwrite pointers in order to prevent the following race:
    >         i. GC fetches from-space pointer at addr
    >        ii. GC computes equivalent to-space pointer
    >       iii. Mutator overwrites pointer at addr with new_val
    >       iv. GC overwrites pointer at addr with to-space replacement of
    > original from-space value, incorrectly clobbering the mutation!
    >      The GC thread uses cmpxchg_oop to abandon its overwrite attempt
    > in case the value has changed since it began its effort.
    >

    That is not quite right. That race is solved by the GC thread using a
    simple CAS. See here:

    https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp#l180

    >   2. A mutator thread fetches a pointer from memory and the LRB
    > discovers that the pointer refers to from-space memory.  After
    > determining the to-space pointer that represents the new location of
    > the object originally referenced by the fetched from-space pointer,
    > the mutator attempts to heal the value held in memory by overwriting
    > that memory with the to-space equivalent.  The healing effort must
    > use cmpxchg_oop to resolve the exact same race condition that might
    > occur when a background GC thread is healing a from-space pointer
    > residing in memory.
    >

    That race is also solved by a simple CAS:
    https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahBarrierSet.inline.hpp#l73

    > In both of the above cases, we know that GC is active and both
    > expected and new_val are not equal to null.  We also know for both of
    > these cases, by the way, that expected refers to from-space, so maybe
    > our implementation of cmpxchg_oop should really be starting with step
    > 2!
    >
    > Can you help me understand the other situations in which cmpxchg_oop
    > will be called, under which expected and new_val may equal null?

    The assembler cmpxchg_oop() routine is only used by the C1 and C2
    compilers to implement the intrinsics for
    Unsafe.compareAndSwapReference() and the whole family of related
    methods. There is also a weakCompareAndSwapReference(), which is where
    the weak variant comes into play. The problem that it is trying to
    solve is the false negative that arises when the value in memory is a
    from-space reference, but the expected value is a to-space reference.
    In this case, a simple CAS would fail, even though it shouldn't. An
    additional problem is that, when we retry the CAS with the from-space
    ref, another thread (mutator or GC) may update the field to now point
    to the to-space ref of the same object, in which case the 2nd CAS would
    fail too, and we need to retry a 3rd time. That 3rd case is very rare.

    It is entirely possible to see NULL refs there, both in-memory and also
    for the expected-value. The compiler can optimize the case when it can
    statically determine that the expected-value is NULL, and generate a
    simple CAS then, but NULLs can still happen at runtime, and we need to
    handle this, either by explicit short-cut or by dealing with it when
    resolving the object (that is what we currently do).

    Now, about the weak property. I am really not sure what to do.
    Technically we'd be allowed to produce false negatives with weak-CAS,
    but it is not intended to consistently produce false negatives
    repeatedly. It basically caters for the platforms that can use LL/SC
    pair which may fail sporadically and would only ever be used in a loop.
    If we'd do weak-CAS as simple CAS without our protocol, that would be
    legal, and should work, but it would send mutator threads in fairly
    long retry-loops until another thread heals the reference. OTOH, our
    protocol already does the retry-loop and turns any weak-CAS into a
    strong-CAS. So we can just as well ignore (or remove) the weak
    argument.

    Does it clarify the situation for you?
    Roman

    > Thanks.
    >
    >
    >
    > On 7/1/20, 3:56 PM, "Roman Kennke" <rkennke at redhat.com> wrote:
    >
    >     CAUTION: This email originated from outside of the organization.
    > Do not click links or open attachments unless you can confirm the
    > sender and know the content is safe.
    >
    >
    >
    >     Ah now I remember why we want to do the CAS barrier even for
    > weak-CAS:
    >     while it is technically true that calling code must be prepared
    > to deal
    >     with occasional spurious failures, we probably don't want it to
    > get
    >     stuck in retry-loop until GC gets to update the field - which may
    > be a
    >     fairly long period of time. Not sure if we can draw any other
    > benefit
    >     from 'weak' here?
    >
    >     Another *possible* optimization (while we're at it) is to check
    > if GC
    >     is actually active (HAS_FORWARDED, see LRB impl) and avoid the
    > whole
    >     decoding of failure-witness business. This may be something to
    > keep in
    >     mind should we ever find out that this code affects us outside of
    > GC
    >     cycles.
    >
    >     Roman
    >
    >     On Thu, 2020-07-02 at 00:20 +0200, Roman Kennke wrote:
    >     > Hi Kelvin,
    >     >
    >     > I had something similar in mind with x86, but Aleksey stopped
    > me :-)
    >     > It
    >     > may be worth noting that we need a maximum of 3 consequtive
    > CASes to
    >     > cover all cases, it is not necessary to make a loop:
    >     >
    >     > 1. Fast-path CAS#1
    >     > If that fails, it *may* be because of value-in-memory being a
    > from-
    >     > space-ref. Check that and do:
    >     > 2. CAS#2 with expected = previous from CAS#1
    >     > If that fails, it may be because a competing thread just wrote
    > a to-
    >     > space-ref of our expected object (i.e. our *original* expected
    >     > value),
    >     > try again:
    >     > 3. CAS#3 with expected = previous from CAS#2 (== original
    > expected)
    >     > at that point the CAS cannot fail because of false negative
    > because
    >     > of
    >     > to-space-invariant
    >     >
    >     > Correct me if I am wrong here!
    >     >
    >     > Ah I see you optimized for smaller code there:
    >     > +  // It is extremely rare we reach this point.  For this
    > reason, the
    >     > +  // implementation opts for smaller rather than potentially
    > faster
    >     > +  // code.  Ultimately, smaller code for this rare case most
    > likely
    >     > +  // delivers higher overall throughput by enabling improved
    > icache
    >     > +  // performance.
    >     > Good then.
    >     >
    >     > Some comments on the patch:
    >     >
    >     > // It is required that addr represent a
    >     > +// memory location at which a non-null reference value is
    > stored and
    >     > that
    >     > +// expected holds a non-null reference value.
    >     >
    >     > Is this true? I believe the memory location may hold NULL. The
    >     > expected
    >     > value may hold NULL too, in which case we can skip the barrier.
    > The
    >     > compiler optimizes the case where it *knows statically* that
    > expected
    >     > == NULL to not emit the barrier. We also should have run-time
    > checks
    >     > for these situations that skip to 'done' on first CAS-failure
    > with
    >     > NULL
    >     > for the case when compiler cannot determine statically NULL, or
    >     > alternatively don't use the _not_null() variants of
    > encode/decode
    >     > methods.
    >     >
    >     > +//  weak:    This relaxes the "strong" property so that CAS is
    >     > allowed
    >     > +//           to fail even when the expected value is present
    > in
    >     > memory.
    >     > +//           This is useful for load-with-lock, store-
    > conditional
    >     > +//           loops where certain failures require retries.  If
    > weak
    >     > is
    >     > +//           enabled, it is ok to return failure rather than
    >     > retrying.
    >     >
    >     > If we are certain that the meaning of 'weak' allows us to skip
    > the
    >     > barrier, we can change/remove our corresponding implementation
    > of the
    >     > weak paths to emit a plain CAS. Right?
    >     >
    >     > +  // Try to CAS with given arguments using LL/SC pair.  If
    >     > successful,
    >     > +  // then we are done.
    >     >
    >     > LL/SC pair is not (generally) true anymore with this updated
    > patch.
    >     >
    >     > Other than that, I like the impl.
    >     >
    >     > I'd actually like to see a similar implementation in x86, I
    > believe
    >     > it
    >     > should be slightly more efficient. However, I'm not sure it's
    > really
    >     > worth - I don't think I've ever seen this code path very hot
    >     > anywhere.
    >     > We'd need to convince Aleksey though - this code has undergone
    > a
    >     > couple
    >     > of changes already and it's basically always caused unforeseen
    >     > troubles
    >     > (weird register clashes, extremely rare corner cases like the
    > above
    >     > mentioned 3rd CAS, etc)
    >     >
    >     > Thank you!
    >     > Roman
    >     >
    >     >
    >     > On Tue, 2020-06-30 at 22:39 +0000, Nilsen, Kelvin wrote:
    >     > > Thank you for feedback from previously distributed draft
    >     > > patch.  This
    >     > > new patch is similar to the patch distributed on June
    > 24.  However,
    >     > > this version uses MacroAssembler::cmpxchng() instead of hard-
    > coding
    >     > > the use of ldxr/stxr instructions.
    >     > >
    >     > > See
    > http://cr.openjdk.java.net/~kdnilsen/JDK-8232782/webrev.02/
    >     > >
    >     > > This patch addresses the problem described in
    >     > > https://bugs.openjdk.java.net/browse/JDK-8232782
    >     > >
    >     > > The implementation mimics the behavior of the recently
    > revised x86
    >     > > implementation of cmpxchg_oop with slight refinements:
    >     > >
    >     > > X86 version:
    >     > > Step 1: Try CAS
    >     > > Step 2: if CAS fails, check if original memory holds
    > equivalent
    >     > > from-
    >     > > space pointer
    >     > > Step 3: Use CAS to overwrite memory with equivalent to-space
    >     > > pointer
    >     > > Step 4: Try CAS again
    >     > > Step 5: Return boolean result to indicate success or failure
    >     > >
    >     > > AARCH64 version:
    >     > > Step 1: Try CAS
    >     > > Step 2: if CAS fails, check if original memory holds
    > equivalent
    >     > > from-
    >     > > space pointer
    >     > > Step 3 (differs): Do not overwrite memory with equivalent to-
    > space
    >     > > pointer,  Instead, run the original CAS request with from-
    > space
    >     > > pointer as the "expected" value. If this succeeds, we're
    > done.  If
    >     > > this fails, go back to step 1 and try that again.
    >     > >
    >     > > Step 5: Return boolean result to indicate success or failure
    >     > >
    >     > > This patch satisfies tier1, tier2, and hotspot_gc_shenandoah
    >     > > regression tests on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1023-
    > aws
    >     > > aarch64).  I have also run an "extreme" garbage collection
    > workload
    >     > > for 20 minutes without problem.
    >     > >
    >     > > Is this ok to merge?
    >     > >
    >     > >
    >     > >
    >
    >