[concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics
Oleksandr Otenko
oleksandr.otenko at oracle.com
Tue Dec 9 22:58:40 UTC 2014
I see it differently. The issue is ordering - the inability of non-TSO
platforms enforce total order of independent stores. The first loads are
also independent and their ordering can neither be enforced, nor
detected. But the following load can detect the lack of total ordering
of stores and loads, so it is enforced through a heavyweight barrier.
But I understood now why other barriers won't work. Thank you.
Alex
On 09/12/2014 21:59, David Holmes wrote:
> In this case the issue is not ordering per-se (which is what
> dependencies help with) but global visibility. After performing the
> first read each thread must ensure that its second read will return
> what the other thread saw for the first read - hence a full dmb/sync
> between the reads; or generalizing a full dmb/sync after every
> volatile read.
> David
>
> -----Original Message-----
> *From:* Oleksandr Otenko [mailto:oleksandr.otenko at oracle.com]
> *Sent:* Wednesday, 10 December 2014 7:54 AM
> *To:* dholmes at ieee.org; Hans Boehm
> *Cc:* core-libs-dev; concurrency-interest at cs.oswego.edu
> *Subject:* Re: [concurrency-interest] RFR: 8065804:
> JEP171:Clarifications/corrections for fence intrinsics
>
> Yes, I do understand the reader needs barriers, too. I guess I was
> wondering more why the reader would need something stronger than
> what dependencies etc could enforce. I guess I'll read what Martin
> forwarded first.
>
> Alex
>
>
> On 09/12/2014 21:37, David Holmes wrote:
>> See my earlier response to Martin. The reader has to force a
>> consistent view of memory - the writer can't as the write escapes
>> before it can issue the barrier.
>> David
>>
>> -----Original Message-----
>> *From:* concurrency-interest-bounces at cs.oswego.edu
>> [mailto:concurrency-interest-bounces at cs.oswego.edu]*On Behalf
>> Of *Oleksandr Otenko
>> *Sent:* Wednesday, 10 December 2014 6:04 AM
>> *To:* Hans Boehm; dholmes at ieee.org
>> *Cc:* core-libs-dev; concurrency-interest at cs.oswego.edu
>> *Subject:* Re: [concurrency-interest] RFR: 8065804:
>> JEP171:Clarifications/corrections for fence intrinsics
>>
>> On 26/11/2014 02:04, Hans Boehm wrote:
>>> To be concrete here, on Power, loads can normally be ordered
>>> by an address dependency or light-weight fence (lwsync).
>>> However, neither is enough to prevent the questionable
>>> outcome for IRIW, since it doesn't ensure that the stores in
>>> T1 and T2 will be made visible to other threads in a
>>> consistent order. That outcome can be prevented by using
>>> heavyweight fences (sync) instructions between the loads
>>> instead.
>>
>> Why would they need fences between loads instead of syncing
>> the order of stores?
>>
>>
>> Alex
>>
>>
>>> Peter Sewell's group concluded that to enforce correct
>>> volatile behavior on Power, you essentially need a a
>>> heavyweight fence between every pair of volatile operations
>>> on Power. That cannot be understood based on simple
>>> ordering constraints.
>>>
>>> As Stephan pointed out, there are similar issues on ARM, but
>>> they're less commonly encountered in a Java implementation.
>>> If you're lucky, you can get to the right implementation
>>> recipe by looking at only reordering, I think.
>>>
>>>
>>> On Tue, Nov 25, 2014 at 4:36 PM, David Holmes
>>> <davidcholmes at aapt.net.au <mailto:davidcholmes at aapt.net.au>>
>>> wrote:
>>>
>>> Stephan Diestelhorst writes:
>>> >
>>> > David Holmes wrote:
>>> > > Stephan Diestelhorst writes:
>>> > > > Am Dienstag, 25. November 2014, 11:15:36 schrieb
>>> Hans Boehm:
>>> > > > > I'm no hardware architect, but fundamentally it
>>> seems to me that
>>> > > > >
>>> > > > > load x
>>> > > > > acquire_fence
>>> > > > >
>>> > > > > imposes a much more stringent constraint than
>>> > > > >
>>> > > > > load_acquire x
>>> > > > >
>>> > > > > Consider the case in which the load from x is an
>>> L1 hit, but a
>>> > > > > preceding load (from say y) is a long-latency
>>> miss. If we enforce
>>> > > > > ordering by just waiting for completion of prior
>>> operation, the
>>> > > > > former has to wait for the load from y to
>>> complete; while the
>>> > > > > latter doesn't. I find it hard to believe that
>>> this doesn't leave
>>> > > > > an appreciable amount of performance on the
>>> table, at least for
>>> > > > > some interesting microarchitectures.
>>> > > >
>>> > > > I agree, Hans, that this is a reasonable
>>> assumption. Load_acquire x
>>> > > > does allow roach motel, whereas the acquire fence
>>> does not.
>>> > > >
>>> > > > > In addition, for better or worse, fencing
>>> requirements on at least
>>> > > > > Power are actually driven as much by store
>>> atomicity issues, as by
>>> > > > > the ordering issues discussed in the cookbook.
>>> This was not
>>> > > > > understood in 2005, and unfortunately doesn't
>>> seem to be
>>> > amenable to
>>> > > > > the kind of straightforward explanation as in
>>> Doug's cookbook.
>>> > > >
>>> > > > Coming from a strongly ordered architecture to a
>>> weakly ordered one
>>> > > > myself, I also needed some mental adjustment about
>>> store (multi-copy)
>>> > > > atomicity. I can imagine others will be unaware
>>> of this difference,
>>> > > > too, even in 2014.
>>> > >
>>> > > Sorry I'm missing the connection between fences and
>>> multi-copy
>>> > atomicity.
>>> >
>>> > One example is the classic IRIW. With non-multi copy
>>> atomic stores, but
>>> > ordered (say through a dependency) loads in the
>>> following example:
>>> >
>>> > Memory: foo = bar = 0
>>> > _T1_ _T2_ _T3_ _T4_
>>> > st (foo),1 st (bar),1 ld r1, (bar)
>>> ld r3,(foo)
>>> > <addr dep / local "fence"
>>> here> <addr dep>
>>> > ld r2, (foo)
>>> ld r4, (bar)
>>> >
>>> > You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on
>>> non-multi-copy atomic
>>> > machines. On TSO boxes, this is not possible. That
>>> means that the
>>> > memory fence that will prevent such a behaviour (DMB
>>> on ARM) needs to
>>> > carry some additional oomph in ensuring multi-copy
>>> atomicity, or rather
>>> > prevent you from seeing it (which is the same thing).
>>>
>>> I take it as given that any code for which you may have
>>> ordering
>>> constraints, must first have basic atomicity properties
>>> for loads and
>>> stores. I would not expect any kind of fence to add
>>> multi-copy-atomicity
>>> where there was none.
>>>
>>> David
>>>
>>> > Stephan
>>> >
>>> > _______________________________________________
>>> > Concurrency-interest mailing list
>>> > Concurrency-interest at cs.oswego.edu
>>> <mailto:Concurrency-interest at cs.oswego.edu>
>>> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> <mailto:Concurrency-interest at cs.oswego.edu>
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>
More information about the core-libs-dev
mailing list