RFR: 8232782: Shenandoah: streamline post-LRB CAS barrier (aarch64) (version 2)

Wed Jul 1 22:55:27 UTC 2020

Ah now I remember why we want to do the CAS barrier even for weak-CAS:
while it is technically true that calling code must be prepared to deal
with occasional spurious failures, we probably don't want it to get
stuck in retry-loop until GC gets to update the field - which may be a
fairly long period of time. Not sure if we can draw any other benefit
from 'weak' here?

Another *possible* optimization (while we're at it) is to check if GC
is actually active (HAS_FORWARDED, see LRB impl) and avoid the whole
decoding of failure-witness business. This may be something to keep in
mind should we ever find out that this code affects us outside of GC
cycles.

Roman

On Thu, 2020-07-02 at 00:20 +0200, Roman Kennke wrote:
> Hi Kelvin,
> 
> I had something similar in mind with x86, but Aleksey stopped me :-)
> It
> may be worth noting that we need a maximum of 3 consequtive CASes to
> cover all cases, it is not necessary to make a loop:
> 
> 1. Fast-path CAS#1
> If that fails, it *may* be because of value-in-memory being a from-
> space-ref. Check that and do:
> 2. CAS#2 with expected = previous from CAS#1
> If that fails, it may be because a competing thread just wrote a to-
> space-ref of our expected object (i.e. our *original* expected
> value),
> try again:
> 3. CAS#3 with expected = previous from CAS#2 (== original expected)
> at that point the CAS cannot fail because of false negative because
> of
> to-space-invariant
> 
> Correct me if I am wrong here!
> 
> Ah I see you optimized for smaller code there:
> +  // It is extremely rare we reach this point.  For this reason, the
> +  // implementation opts for smaller rather than potentially faster
> +  // code.  Ultimately, smaller code for this rare case most likely
> +  // delivers higher overall throughput by enabling improved icache
> +  // performance.
> Good then.
> 
> Some comments on the patch:
> 
> // It is required that addr represent a
> +// memory location at which a non-null reference value is stored and
> that
> +// expected holds a non-null reference value.
> 
> Is this true? I believe the memory location may hold NULL. The
> expected
> value may hold NULL too, in which case we can skip the barrier. The
> compiler optimizes the case where it *knows statically* that expected
> == NULL to not emit the barrier. We also should have run-time checks
> for these situations that skip to 'done' on first CAS-failure with
> NULL
> for the case when compiler cannot determine statically NULL, or
> alternatively don't use the _not_null() variants of encode/decode
> methods.
> 
> +//  weak:    This relaxes the "strong" property so that CAS is
> allowed
> +//           to fail even when the expected value is present in
> memory.
> +//           This is useful for load-with-lock, store-conditional
> +//           loops where certain failures require retries.  If weak
> is
> +//           enabled, it is ok to return failure rather than
> retrying.
> 
> If we are certain that the meaning of 'weak' allows us to skip the
> barrier, we can change/remove our corresponding implementation of the
> weak paths to emit a plain CAS. Right?
> 
> +  // Try to CAS with given arguments using LL/SC pair.  If
> successful,
> +  // then we are done.
> 
> LL/SC pair is not (generally) true anymore with this updated patch.
> 
> Other than that, I like the impl.
> 
> I'd actually like to see a similar implementation in x86, I believe
> it
> should be slightly more efficient. However, I'm not sure it's really
> worth - I don't think I've ever seen this code path very hot
> anywhere.
> We'd need to convince Aleksey though - this code has undergone a
> couple
> of changes already and it's basically always caused unforeseen
> troubles
> (weird register clashes, extremely rare corner cases like the above
> mentioned 3rd CAS, etc)
> 
> Thank you!
> Roman
> 
> 
> On Tue, 2020-06-30 at 22:39 +0000, Nilsen, Kelvin wrote:
> > Thank you for feedback from previously distributed draft
> > patch.  This
> > new patch is similar to the patch distributed on June 24.  However,
> > this version uses MacroAssembler::cmpxchng() instead of hard-coding
> > the use of ldxr/stxr instructions.
> > 
> > See http://cr.openjdk.java.net/~kdnilsen/JDK-8232782/webrev.02/
> > 
> > This patch addresses the problem described in 
> > https://bugs.openjdk.java.net/browse/JDK-8232782
> > 
> > The implementation mimics the behavior of the recently revised x86
> > implementation of cmpxchg_oop with slight refinements:
> > 
> > X86 version:
> > Step 1: Try CAS
> > Step 2: if CAS fails, check if original memory holds equivalent
> > from-
> > space pointer
> > Step 3: Use CAS to overwrite memory with equivalent to-space
> > pointer
> > Step 4: Try CAS again
> > Step 5: Return boolean result to indicate success or failure
> > 
> > AARCH64 version:
> > Step 1: Try CAS
> > Step 2: if CAS fails, check if original memory holds equivalent
> > from-
> > space pointer
> > Step 3 (differs): Do not overwrite memory with equivalent to-space
> > pointer,  Instead, run the original CAS request with from-space
> > pointer as the "expected" value. If this succeeds, we're done.  If
> > this fails, go back to step 1 and try that again.
> > 
> > Step 5: Return boolean result to indicate success or failure
> > 
> > This patch satisfies tier1, tier2, and hotspot_gc_shenandoah
> > regression tests on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1023-aws
> > aarch64).  I have also run an "extreme" garbage collection workload
> > for 20 minutes without problem.
> > 
> > Is this ok to merge?
> > 
> > 
> >