RFR: 8232782: Shenandoah: streamline post-LRB CAS barrier (aarch64) (version 2)
Roman Kennke
rkennke at redhat.com
Thu Jul 2 08:49:45 UTC 2020
Hi Kelvin,
> Thanks for the careful review. I'll make another round with it.
>
> I wasn’t entirely sure what the preconditions for this function were
> so I documented my best guess. Good thing you reviewed my comments
> carefully because apparently I guessed wrong.
>
> Regarding use of the weak Boolean argument, it was not clear to me
> why a caller would want to specify weak, but I assumed if they did
> request it, I should go ahead and honor it. Are you suggesting I
> should ignore the value of the weak argument and always behave as if
> the caller had requested not weak?
That is a good question. Let me answer the next one first, maybe it
becomes clearer then.
> Also, wil you check my understanding further? I believe this
> service (cmpxchg_oop) is used primarily to resolve races between
> background GC threads and mutator threads. There is no need to need
> for the JVM to resolve races between multiple mutator threads. That
> is a programmer responsibility. The reason for cmpxchg_oop is to
> resolve the following two race conditions:
>
> 1. A mutator thread is overwriting a field that the GC may be
> overwriting in parallel. The mutator needs to use this protocol to
> overwrite pointers in order to prevent the following race:
> i. GC fetches from-space pointer at addr
> ii. GC computes equivalent to-space pointer
> iii. Mutator overwrites pointer at addr with new_val
> iv. GC overwrites pointer at addr with to-space replacement of
> original from-space value, incorrectly clobbering the mutation!
> The GC thread uses cmpxchg_oop to abandon its overwrite attempt
> in case the value has changed since it began its effort.
>
That is not quite right. That race is solved by the GC thread using a
simple CAS. See here:
https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp#l180
> 2. A mutator thread fetches a pointer from memory and the LRB
> discovers that the pointer refers to from-space memory. After
> determining the to-space pointer that represents the new location of
> the object originally referenced by the fetched from-space pointer,
> the mutator attempts to heal the value held in memory by overwriting
> that memory with the to-space equivalent. The healing effort must
> use cmpxchg_oop to resolve the exact same race condition that might
> occur when a background GC thread is healing a from-space pointer
> residing in memory.
>
That race is also solved by a simple CAS:
https://hg.openjdk.java.net/jdk/jdk/file/d886e752a7b0/src/hotspot/share/gc/shenandoah/shenandoahBarrierSet.inline.hpp#l73
> In both of the above cases, we know that GC is active and both
> expected and new_val are not equal to null. We also know for both of
> these cases, by the way, that expected refers to from-space, so maybe
> our implementation of cmpxchg_oop should really be starting with step
> 2!
>
> Can you help me understand the other situations in which cmpxchg_oop
> will be called, under which expected and new_val may equal null?
The assembler cmpxchg_oop() routine is only used by the C1 and C2
compilers to implement the intrinsics for
Unsafe.compareAndSwapReference() and the whole family of related
methods. There is also a weakCompareAndSwapReference(), which is where
the weak variant comes into play. The problem that it is trying to
solve is the false negative that arises when the value in memory is a
from-space reference, but the expected value is a to-space reference.
In this case, a simple CAS would fail, even though it shouldn't. An
additional problem is that, when we retry the CAS with the from-space
ref, another thread (mutator or GC) may update the field to now point
to the to-space ref of the same object, in which case the 2nd CAS would
fail too, and we need to retry a 3rd time. That 3rd case is very rare.
It is entirely possible to see NULL refs there, both in-memory and also
for the expected-value. The compiler can optimize the case when it can
statically determine that the expected-value is NULL, and generate a
simple CAS then, but NULLs can still happen at runtime, and we need to
handle this, either by explicit short-cut or by dealing with it when
resolving the object (that is what we currently do).
Now, about the weak property. I am really not sure what to do.
Technically we'd be allowed to produce false negatives with weak-CAS,
but it is not intended to consistently produce false negatives
repeatedly. It basically caters for the platforms that can use LL/SC
pair which may fail sporadically and would only ever be used in a loop.
If we'd do weak-CAS as simple CAS without our protocol, that would be
legal, and should work, but it would send mutator threads in fairly
long retry-loops until another thread heals the reference. OTOH, our
protocol already does the retry-loop and turns any weak-CAS into a
strong-CAS. So we can just as well ignore (or remove) the weak
argument.
Does it clarify the situation for you?
Roman
> Thanks.
>
>
>
> On 7/1/20, 3:56 PM, "Roman Kennke" <rkennke at redhat.com> wrote:
>
> CAUTION: This email originated from outside of the organization.
> Do not click links or open attachments unless you can confirm the
> sender and know the content is safe.
>
>
>
> Ah now I remember why we want to do the CAS barrier even for
> weak-CAS:
> while it is technically true that calling code must be prepared
> to deal
> with occasional spurious failures, we probably don't want it to
> get
> stuck in retry-loop until GC gets to update the field - which may
> be a
> fairly long period of time. Not sure if we can draw any other
> benefit
> from 'weak' here?
>
> Another *possible* optimization (while we're at it) is to check
> if GC
> is actually active (HAS_FORWARDED, see LRB impl) and avoid the
> whole
> decoding of failure-witness business. This may be something to
> keep in
> mind should we ever find out that this code affects us outside of
> GC
> cycles.
>
> Roman
>
> On Thu, 2020-07-02 at 00:20 +0200, Roman Kennke wrote:
> > Hi Kelvin,
> >
> > I had something similar in mind with x86, but Aleksey stopped
> me :-)
> > It
> > may be worth noting that we need a maximum of 3 consequtive
> CASes to
> > cover all cases, it is not necessary to make a loop:
> >
> > 1. Fast-path CAS#1
> > If that fails, it *may* be because of value-in-memory being a
> from-
> > space-ref. Check that and do:
> > 2. CAS#2 with expected = previous from CAS#1
> > If that fails, it may be because a competing thread just wrote
> a to-
> > space-ref of our expected object (i.e. our *original* expected
> > value),
> > try again:
> > 3. CAS#3 with expected = previous from CAS#2 (== original
> expected)
> > at that point the CAS cannot fail because of false negative
> because
> > of
> > to-space-invariant
> >
> > Correct me if I am wrong here!
> >
> > Ah I see you optimized for smaller code there:
> > + // It is extremely rare we reach this point. For this
> reason, the
> > + // implementation opts for smaller rather than potentially
> faster
> > + // code. Ultimately, smaller code for this rare case most
> likely
> > + // delivers higher overall throughput by enabling improved
> icache
> > + // performance.
> > Good then.
> >
> > Some comments on the patch:
> >
> > // It is required that addr represent a
> > +// memory location at which a non-null reference value is
> stored and
> > that
> > +// expected holds a non-null reference value.
> >
> > Is this true? I believe the memory location may hold NULL. The
> > expected
> > value may hold NULL too, in which case we can skip the barrier.
> The
> > compiler optimizes the case where it *knows statically* that
> expected
> > == NULL to not emit the barrier. We also should have run-time
> checks
> > for these situations that skip to 'done' on first CAS-failure
> with
> > NULL
> > for the case when compiler cannot determine statically NULL, or
> > alternatively don't use the _not_null() variants of
> encode/decode
> > methods.
> >
> > +// weak: This relaxes the "strong" property so that CAS is
> > allowed
> > +// to fail even when the expected value is present
> in
> > memory.
> > +// This is useful for load-with-lock, store-
> conditional
> > +// loops where certain failures require retries. If
> weak
> > is
> > +// enabled, it is ok to return failure rather than
> > retrying.
> >
> > If we are certain that the meaning of 'weak' allows us to skip
> the
> > barrier, we can change/remove our corresponding implementation
> of the
> > weak paths to emit a plain CAS. Right?
> >
> > + // Try to CAS with given arguments using LL/SC pair. If
> > successful,
> > + // then we are done.
> >
> > LL/SC pair is not (generally) true anymore with this updated
> patch.
> >
> > Other than that, I like the impl.
> >
> > I'd actually like to see a similar implementation in x86, I
> believe
> > it
> > should be slightly more efficient. However, I'm not sure it's
> really
> > worth - I don't think I've ever seen this code path very hot
> > anywhere.
> > We'd need to convince Aleksey though - this code has undergone
> a
> > couple
> > of changes already and it's basically always caused unforeseen
> > troubles
> > (weird register clashes, extremely rare corner cases like the
> above
> > mentioned 3rd CAS, etc)
> >
> > Thank you!
> > Roman
> >
> >
> > On Tue, 2020-06-30 at 22:39 +0000, Nilsen, Kelvin wrote:
> > > Thank you for feedback from previously distributed draft
> > > patch. This
> > > new patch is similar to the patch distributed on June
> 24. However,
> > > this version uses MacroAssembler::cmpxchng() instead of hard-
> coding
> > > the use of ldxr/stxr instructions.
> > >
> > > See
> http://cr.openjdk.java.net/~kdnilsen/JDK-8232782/webrev.02/
> > >
> > > This patch addresses the problem described in
> > > https://bugs.openjdk.java.net/browse/JDK-8232782
> > >
> > > The implementation mimics the behavior of the recently
> revised x86
> > > implementation of cmpxchg_oop with slight refinements:
> > >
> > > X86 version:
> > > Step 1: Try CAS
> > > Step 2: if CAS fails, check if original memory holds
> equivalent
> > > from-
> > > space pointer
> > > Step 3: Use CAS to overwrite memory with equivalent to-space
> > > pointer
> > > Step 4: Try CAS again
> > > Step 5: Return boolean result to indicate success or failure
> > >
> > > AARCH64 version:
> > > Step 1: Try CAS
> > > Step 2: if CAS fails, check if original memory holds
> equivalent
> > > from-
> > > space pointer
> > > Step 3 (differs): Do not overwrite memory with equivalent to-
> space
> > > pointer, Instead, run the original CAS request with from-
> space
> > > pointer as the "expected" value. If this succeeds, we're
> done. If
> > > this fails, go back to step 1 and try that again.
> > >
> > > Step 5: Return boolean result to indicate success or failure
> > >
> > > This patch satisfies tier1, tier2, and hotspot_gc_shenandoah
> > > regression tests on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1023-
> aws
> > > aarch64). I have also run an "extreme" garbage collection
> workload
> > > for 20 minutes without problem.
> > >
> > > Is this ok to merge?
> > >
> > >
> > >
>
>
More information about the shenandoah-dev
mailing list