Atomic operations: your thoughts are welocme

Thu Feb 11 18:29:42 UTC 2021

Hi,

I appreciate this investigation.
PPC64 has optimized versions for _relaxed, _acquire, _release and _acq_rel which are substantially faster than the other memory order modes.
So we should be able to observe performance improvements when any of these ones are used in hot code.

Best regards,
Martin

> -----Original Message-----
> From: hotspot-dev <hotspot-dev-retn at openjdk.java.net> On Behalf Of
> Andrew Haley
> Sent: Donnerstag, 11. Februar 2021 14:34
> To: Kim Barrett <kim.barrett at oracle.com>
> Cc: hotspot-gc-dev openjdk.java.net <hotspot-gc-dev at openjdk.java.net>;
> hotspot-dev at openjdk.java.net
> Subject: Re: Atomic operations: your thoughts are welocme
> 
> On 11/02/2021 03:59, Kim Barrett wrote:
> >> On Feb 8, 2021, at 1:14 PM, Andrew Haley <aph at redhat.com> wrote:
> >>
> >> I've been looking at the hottest Atomic operations in HotSpot, with a view
> to
> >> finding out if the default memory_order_conservative (which is very
> expensive
> >> on some architectures) can be weakened to something less. It's
> impossible to
> >> fix all of them, but perhaps we can fix some of the most frequent.
> >
> > Is there any information about the possible performance improvement
> from
> > such changes?  1.5-3M occurrences doesn't mean much without context.
> >
> > We don't presently have support for sequentially consistent semantics,
> only
> > "conservative". My recollection is that this is in part because there might
> > be code that is assuming the possibly stronger "conservative" semantics,
> and
> > in part because there are different and incompatible approaches to
> > implementing sequentially consistent semantics on some hardware
> platforms
> > and we didn't want to make assumptions there.
> >
> > We also don't presently have any cmpxchg implementation that really
> supports
> > anything between conservative and relaxed, nor do we support different
> order
> > constraints for the success vs failure cases. Things can be complicated
> > enough as is; while we *could* fill some of that in, I'm not sure we should.
> 
> OK. However, even though we don't implement any of them, we do have an
> API that includes acq, rel, and seq_cst. The fact that we don't have
> anything behind them is, I thought, To Be Done rather than Won't Do.
> 
> >>
> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<Inst
> anceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: =
> 3903178
> >>
> >> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does
> this
> >> need to be memory_order_conservative, or would something weaker
> do? Even
> >> acq_rel or seq_cst would be better.
> >
> > I think for setting bits in a bitmap the thing to do would be to identify
> > places that are safe and useful (impacts performance) to do so first. Then
> > add a weaker variant for use in those places, assuming any are found.
> 
> I see. I'm assuming that frequency of use is a useful proxy for impact.
> Aleksey has already, very helpfully, measured how significant these are
> for Shenandoah, and I suspect all concurrent GCs would benefit in a
> similar fashion.
> 
> >> <Symbol::decrement_refcount()+44>:      :: = 2376632
> >> <Symbol::increment_refcount()+44>:      :: = 2003895
> >>
> >> I can't imagine that either of these actually need
> memory_order_conservative,
> >> they're just reference counts.
> >
> > The "usual" refcount implementation involves relaxed increment and
> stronger
> > ordering for decrement. (If I'm remembering correctly, dec-acquire and a
> > release fence on the zero value path before deleting. But I've not thought
> > about what one might want for this CAS-based variant that handles
> boundary
> > cases specially.) And as you say, whether any of these could be weakened
> > depends on whether there is any code surrounding a use that depends on
> the
> > stronger ordering semantics. At a guess I suspect increment could be
> changed
> > to relaxed, but I've not looked. This one is probably a question for runtime
> > folks.
> 
> OK, this makes sense. I'm thinking of the long road to getting this stuff
> documented so that we can see what side effects of atomic operations are
> actually required.
> 
> >> <OtherRegionsTable::add_reference(void*, unsigned int)+248>:    :: =
> 1719614
> >>
> >> BitMap::par_set_bit again.
> >>
> >>
> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<Ov
> erflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>,
> (MEMFLAGS)5>*)+432>:      :: = 1617659
> >>
> >> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
> >> Again, do we need conservative here?
> >
> > This needs at least sequentially consistent semantics on the success path.
> 
> Yep. That's easy, it's the full barrier in the failure path that
> I'd love to eliminate.
> 
> > See the referenced paper by Le, et al.
> >
> > There is also a cmpxchg_age in pop_local_slow.  The Le, et al paper doesn't
> > deal with that path.  But it's also not in your list, which is good since
> > this is supposed to be infrequently taken.
> 
> Right. I'm trying to concentrate on the low-hanging fruit.
> 
> Thank you for the very detailed and informative reply.
> 
> --
> Andrew Haley  (he/him)
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> https://keybase.io/andrewhaley
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671