Atomic operations: your thoughts are welocme

Fri Feb 12 10:25:42 UTC 2021

On 12/02/2021 09:58, Kim Barrett wrote:
>> On Feb 11, 2021, at 8:33 AM, Andrew Haley <aph at redhat.com> wrote:
>>
>> On 11/02/2021 03:59, Kim Barrett wrote:
>>>
>>> We also don't presently have any cmpxchg implementation that really supports
>>> anything between conservative and relaxed, nor do we support different order
>>> constraints for the success vs failure cases. Things can be complicated
>>> enough as is; while we *could* fill some of that in, I'm not sure we should.
>>
>> OK. However, even though we don't implement any of them, we do have an
>> API that includes acq, rel, and seq_cst. The fact that we don't have
>> anything behind them is, I thought, To Be Done rather than Won't Do.
> 
> My inclination is to be pretty conservative in this area. (No pun intended.)
> I'm not eager to have a lot of reviews like that for JDK-8154736. (And in
> looking back at that, I see we ended up not addressing non-ppc platforms,
> even though there was specific concern at the time that by not dealing with
> them (particularly arm/aarch64) that we might be fobbing off some really
> hard debugging on some poor future person.)

Sure, and as you are probably aware I've had to do that, more than
once, on dusty old GC code that didn't follow the memory model.

IMVHO, there are not many places where seq_cst won't be adequate.

>> I see. I'm assuming that frequency of use is a useful proxy for impact.
>> Aleksey has already, very helpfully, measured how significant these are
>> for Shenandoah, and I suspect all concurrent GCs would benefit in a
>> similar fashion.
> 
> Absolute counts don't say much without context. So what if there are a
> million of these, if they are swamped by the 100 bazillion not-these?
> 
> Aleksey's measurements turned out to be less informative to me than they
> seemed at first reading. Many of the proposed changes involve simple
> counters or accumulators. Changing such to use relaxed atomic addition
> operations is likely an easy improvement. But even that can suffer badly
> from contention. If one is serious about reducing the cost of multi-threaded
> accumulators, much better would be something like
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0261r4.html

I very strongly disagree. Aleksey managed to prove a substantial
gain with only a couple of hours' work. We're talking about low-
hanging fruit here.

>>>> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659
>>>>
>>>> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
>>>> Again, do we need conservative here?
>>>
>>> This needs at least sequentially consistent semantics on the success path.
>>
>> Yep. That's easy, it's the full barrier in the failure path that
>> I'd love to eliminate.
> 
> Why does the failure path matter here?
> 
> It should be rare [*], since it only fails when either there is contention
> between a thief and the owner for the sole entry in the queue, or there is
> contention between multiple thieves.

OK, so that's useful guidance for an implementer: full barriers for CAS
failures should be wrapped in a conditional. That is a pain, because it
complexifies the code, but OK.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671