RFR: 8232782: Shenandoah: streamline post-LRB CAS barrier (aarch64) (version 2)
Nilsen, Kelvin
kdnilsen at amazon.com
Thu Jul 2 00:35:21 UTC 2020
Thanks for the careful review. I'll make another round with it.
I wasn’t entirely sure what the preconditions for this function were so I documented my best guess. Good thing you reviewed my comments carefully because apparently I guessed wrong.
Regarding use of the weak Boolean argument, it was not clear to me why a caller would want to specify weak, but I assumed if they did request it, I should go ahead and honor it. Are you suggesting I should ignore the value of the weak argument and always behave as if the caller had requested not weak?
Also, wil you check my understanding further? I believe this service (cmpxchg_oop) is used primarily to resolve races between background GC threads and mutator threads. There is no need to need for the JVM to resolve races between multiple mutator threads. That is a programmer responsibility. The reason for cmpxchg_oop is to resolve the following two race conditions:
1. A mutator thread is overwriting a field that the GC may be overwriting in parallel. The mutator needs to use this protocol to overwrite pointers in order to prevent the following race:
i. GC fetches from-space pointer at addr
ii. GC computes equivalent to-space pointer
iii. Mutator overwrites pointer at addr with new_val
iv. GC overwrites pointer at addr with to-space replacement of original from-space value, incorrectly clobbering the mutation!
The GC thread uses cmpxchg_oop to abandon its overwrite attempt in case the value has changed since it began its effort.
2. A mutator thread fetches a pointer from memory and the LRB discovers that the pointer refers to from-space memory. After determining the to-space pointer that represents the new location of the object originally referenced by the fetched from-space pointer, the mutator attempts to heal the value held in memory by overwriting that memory with the to-space equivalent. The healing effort must use cmpxchg_oop to resolve the exact same race condition that might occur when a background GC thread is healing a from-space pointer residing in memory.
In both of the above cases, we know that GC is active and both expected and new_val are not equal to null. We also know for both of these cases, by the way, that expected refers to from-space, so maybe our implementation of cmpxchg_oop should really be starting with step 2!
Can you help me understand the other situations in which cmpxchg_oop will be called, under which expected and new_val may equal null?
Thanks.
On 7/1/20, 3:56 PM, "Roman Kennke" <rkennke at redhat.com> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
Ah now I remember why we want to do the CAS barrier even for weak-CAS:
while it is technically true that calling code must be prepared to deal
with occasional spurious failures, we probably don't want it to get
stuck in retry-loop until GC gets to update the field - which may be a
fairly long period of time. Not sure if we can draw any other benefit
from 'weak' here?
Another *possible* optimization (while we're at it) is to check if GC
is actually active (HAS_FORWARDED, see LRB impl) and avoid the whole
decoding of failure-witness business. This may be something to keep in
mind should we ever find out that this code affects us outside of GC
cycles.
Roman
On Thu, 2020-07-02 at 00:20 +0200, Roman Kennke wrote:
> Hi Kelvin,
>
> I had something similar in mind with x86, but Aleksey stopped me :-)
> It
> may be worth noting that we need a maximum of 3 consequtive CASes to
> cover all cases, it is not necessary to make a loop:
>
> 1. Fast-path CAS#1
> If that fails, it *may* be because of value-in-memory being a from-
> space-ref. Check that and do:
> 2. CAS#2 with expected = previous from CAS#1
> If that fails, it may be because a competing thread just wrote a to-
> space-ref of our expected object (i.e. our *original* expected
> value),
> try again:
> 3. CAS#3 with expected = previous from CAS#2 (== original expected)
> at that point the CAS cannot fail because of false negative because
> of
> to-space-invariant
>
> Correct me if I am wrong here!
>
> Ah I see you optimized for smaller code there:
> + // It is extremely rare we reach this point. For this reason, the
> + // implementation opts for smaller rather than potentially faster
> + // code. Ultimately, smaller code for this rare case most likely
> + // delivers higher overall throughput by enabling improved icache
> + // performance.
> Good then.
>
> Some comments on the patch:
>
> // It is required that addr represent a
> +// memory location at which a non-null reference value is stored and
> that
> +// expected holds a non-null reference value.
>
> Is this true? I believe the memory location may hold NULL. The
> expected
> value may hold NULL too, in which case we can skip the barrier. The
> compiler optimizes the case where it *knows statically* that expected
> == NULL to not emit the barrier. We also should have run-time checks
> for these situations that skip to 'done' on first CAS-failure with
> NULL
> for the case when compiler cannot determine statically NULL, or
> alternatively don't use the _not_null() variants of encode/decode
> methods.
>
> +// weak: This relaxes the "strong" property so that CAS is
> allowed
> +// to fail even when the expected value is present in
> memory.
> +// This is useful for load-with-lock, store-conditional
> +// loops where certain failures require retries. If weak
> is
> +// enabled, it is ok to return failure rather than
> retrying.
>
> If we are certain that the meaning of 'weak' allows us to skip the
> barrier, we can change/remove our corresponding implementation of the
> weak paths to emit a plain CAS. Right?
>
> + // Try to CAS with given arguments using LL/SC pair. If
> successful,
> + // then we are done.
>
> LL/SC pair is not (generally) true anymore with this updated patch.
>
> Other than that, I like the impl.
>
> I'd actually like to see a similar implementation in x86, I believe
> it
> should be slightly more efficient. However, I'm not sure it's really
> worth - I don't think I've ever seen this code path very hot
> anywhere.
> We'd need to convince Aleksey though - this code has undergone a
> couple
> of changes already and it's basically always caused unforeseen
> troubles
> (weird register clashes, extremely rare corner cases like the above
> mentioned 3rd CAS, etc)
>
> Thank you!
> Roman
>
>
> On Tue, 2020-06-30 at 22:39 +0000, Nilsen, Kelvin wrote:
> > Thank you for feedback from previously distributed draft
> > patch. This
> > new patch is similar to the patch distributed on June 24. However,
> > this version uses MacroAssembler::cmpxchng() instead of hard-coding
> > the use of ldxr/stxr instructions.
> >
> > See http://cr.openjdk.java.net/~kdnilsen/JDK-8232782/webrev.02/
> >
> > This patch addresses the problem described in
> > https://bugs.openjdk.java.net/browse/JDK-8232782
> >
> > The implementation mimics the behavior of the recently revised x86
> > implementation of cmpxchg_oop with slight refinements:
> >
> > X86 version:
> > Step 1: Try CAS
> > Step 2: if CAS fails, check if original memory holds equivalent
> > from-
> > space pointer
> > Step 3: Use CAS to overwrite memory with equivalent to-space
> > pointer
> > Step 4: Try CAS again
> > Step 5: Return boolean result to indicate success or failure
> >
> > AARCH64 version:
> > Step 1: Try CAS
> > Step 2: if CAS fails, check if original memory holds equivalent
> > from-
> > space pointer
> > Step 3 (differs): Do not overwrite memory with equivalent to-space
> > pointer, Instead, run the original CAS request with from-space
> > pointer as the "expected" value. If this succeeds, we're done. If
> > this fails, go back to step 1 and try that again.
> >
> > Step 5: Return boolean result to indicate success or failure
> >
> > This patch satisfies tier1, tier2, and hotspot_gc_shenandoah
> > regression tests on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1023-aws
> > aarch64). I have also run an "extreme" garbage collection workload
> > for 20 minutes without problem.
> >
> > Is this ok to merge?
> >
> >
> >
More information about the shenandoah-dev
mailing list