Atomic operations: your thoughts are welocme

Kim Barrett kim.barrett at oracle.com
Thu Feb 11 03:59:27 UTC 2021


> On Feb 8, 2021, at 1:14 PM, Andrew Haley <aph at redhat.com> wrote:
> 
> I've been looking at the hottest Atomic operations in HotSpot, with a view to
> finding out if the default memory_order_conservative (which is very expensive
> on some architectures) can be weakened to something less. It's impossible to
> fix all of them, but perhaps we can fix some of the most frequent.

Is there any information about the possible performance improvement from
such changes?  1.5-3M occurrences doesn't mean much without context.

We don't presently have support for sequentially consistent semantics, only
"conservative". My recollection is that this is in part because there might
be code that is assuming the possibly stronger "conservative" semantics, and
in part because there are different and incompatible approaches to
implementing sequentially consistent semantics on some hardware platforms
and we didn't want to make assumptions there.

We also don't presently have any cmpxchg implementation that really supports
anything between conservative and relaxed, nor do we support different order
constraints for the success vs failure cases. Things can be complicated
enough as is; while we *could* fill some of that in, I'm not sure we should.

> These are the hottest compare-and-swap uses in HotSpot, with the count
> at the end of each line.
> 
> <G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+7284>:     :: = 16406757
> 
> This one is already memory_order_relaxed, so no problem.

Right.  Although I’m now wondering why this doesn’t need to do anything on the
failure side, similar to what is needed in the similar place in ParallelGC when that
was changed to use a relaxed cmpxchg.

> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: = 3903178
> 
> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does this
> need to be memory_order_conservative, or would something weaker do? Even
> acq_rel or seq_cst would be better.

I think for setting bits in a bitmap the thing to do would be to identify
places that are safe and useful (impacts performance) to do so first. Then
add a weaker variant for use in those places, assuming any are found.

> <Symbol::decrement_refcount()+44>:      :: = 2376632
> <Symbol::increment_refcount()+44>:      :: = 2003895
> 
> I can't imagine that either of these actually need memory_order_conservative,
> they're just reference counts.

The "usual" refcount implementation involves relaxed increment and stronger
ordering for decrement. (If I'm remembering correctly, dec-acquire and a
release fence on the zero value path before deleting. But I've not thought
about what one might want for this CAS-based variant that handles boundary
cases specially.) And as you say, whether any of these could be weakened
depends on whether there is any code surrounding a use that depends on the
stronger ordering semantics. At a guess I suspect increment could be changed
to relaxed, but I've not looked. This one is probably a question for runtime
folks.

> <OtherRegionsTable::add_reference(void*, unsigned int)+248>:    :: = 1719614
> 
> BitMap::par_set_bit again.
> 
> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659
> 
> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
> Again, do we need conservative here?

This needs at least sequentially consistent semantics on the success path.
See the referenced paper by Le, et al.

There is also a cmpxchg_age in pop_local_slow.  The Le, et al paper doesn't
deal with that path.  But it's also not in your list, which is good since
this is supposed to be infrequently taken.




More information about the hotspot-gc-dev mailing list