Atomic operations: your thoughts are welocme
Kim Barrett
kim.barrett at oracle.com
Fri Feb 12 09:58:37 UTC 2021
> On Feb 11, 2021, at 8:33 AM, Andrew Haley <aph at redhat.com> wrote:
>
> On 11/02/2021 03:59, Kim Barrett wrote:
>>> On Feb 8, 2021, at 1:14 PM, Andrew Haley <aph at redhat.com> wrote:
>>>
>>> I've been looking at the hottest Atomic operations in HotSpot, with a view to
>>> finding out if the default memory_order_conservative (which is very expensive
>>> on some architectures) can be weakened to something less. It's impossible to
>>> fix all of them, but perhaps we can fix some of the most frequent.
>>
>> Is there any information about the possible performance improvement from
>> such changes? 1.5-3M occurrences doesn't mean much without context.
>>
>> We don't presently have support for sequentially consistent semantics, only
>> "conservative". My recollection is that this is in part because there might
>> be code that is assuming the possibly stronger "conservative" semantics, and
>> in part because there are different and incompatible approaches to
>> implementing sequentially consistent semantics on some hardware platforms
>> and we didn't want to make assumptions there.
>>
>> We also don't presently have any cmpxchg implementation that really supports
>> anything between conservative and relaxed, nor do we support different order
>> constraints for the success vs failure cases. Things can be complicated
>> enough as is; while we *could* fill some of that in, I'm not sure we should.
>
> OK. However, even though we don't implement any of them, we do have an
> API that includes acq, rel, and seq_cst. The fact that we don't have
> anything behind them is, I thought, To Be Done rather than Won't Do.
My inclination is to be pretty conservative in this area. (No pun intended.)
I'm not eager to have a lot of reviews like that for JDK-8154736. (And in
looking back at that, I see we ended up not addressing non-ppc platforms,
even though there was specific concern at the time that by not dealing with
them (particularly arm/aarch64) that we might be fobbing off some really
hard debugging on some poor future person.)
>>> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>: :: = 3903178
>>>
>>> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does this
>>> need to be memory_order_conservative, or would something weaker do? Even
>>> acq_rel or seq_cst would be better.
>>
>> I think for setting bits in a bitmap the thing to do would be to identify
>> places that are safe and useful (impacts performance) to do so first. Then
>> add a weaker variant for use in those places, assuming any are found.
>
> I see. I'm assuming that frequency of use is a useful proxy for impact.
> Aleksey has already, very helpfully, measured how significant these are
> for Shenandoah, and I suspect all concurrent GCs would benefit in a
> similar fashion.
Absolute counts don't say much without context. So what if there are a
million of these, if they are swamped by the 100 bazillion not-these?
Aleksey's measurements turned out to be less informative to me than they
seemed at first reading. Many of the proposed changes involve simple
counters or accumulators. Changing such to use relaxed atomic addition
operations is likely an easy improvement. But even that can suffer badly
from contention. If one is serious about reducing the cost of multi-threaded
accumulators, much better would be something like
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0261r4.html
>>> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>: :: = 1617659
>>>
>>> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
>>> Again, do we need conservative here?
>>
>> This needs at least sequentially consistent semantics on the success path.
>
> Yep. That's easy, it's the full barrier in the failure path that
> I'd love to eliminate.
Why does the failure path matter here?
It should be rare [*], since it only fails when either there is contention
between a thief and the owner for the sole entry in the queue, or there is
contention between multiple thieves. The former should be rare because
non-empty queues usually contain more than one element. The latter should be
rare because of the random selection of queues the steal from. And in both
cases a losing thief will look for a new queue to steal from.
[*] The age/top (where pop_global takes from) and bottom (where push adds
and pop_local takes from) used to be adjacent members, so local operations
might induce false-sharing failures for the age/top CAS. These members were
separated in JDK 15.
More information about the hotspot-gc-dev
mailing list