[concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics
David Holmes
davidcholmes at aapt.net.au
Wed Dec 10 00:00:14 UTC 2014
The "no known useful benefit" is based on the paper which states "we are not
aware of any cases where IRIW arises as a natural programming idiom".
I think your example would be written:
Thread 1:
x =1; storestore; y=1;
Thread 2:
r1 = y; r2 =x.
Or more clearly, the most common pattern would be:
Thread1:
data = 1; storestore; dataReady = true;
Thread 2:
if dataReady
r2 = data
The above does not require IRIW. Conversely if you have IRIW you don't need
the storestore.
David
-----Original Message-----
From: concurrency-interest-bounces at cs.oswego.edu
[mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of Oleksandr
Otenko
Sent: Wednesday, 10 December 2014 8:21 AM
Cc: concurrency-interest at cs.oswego.edu; core-libs-dev
Subject: Re: [concurrency-interest] RFR: 8065804:
JEP171:Clarifications/corrections for fence intrinsics
In that case I must say I can't see why you mentioned "no known useful
benefit". The known useful benefit from ordering reads can be seen here:
store in one order:
Thread 1:
x=1
y=1
load in reverse order:
Thread 2:
r1=y;
r2=x;
This is a common pattern, so ordering loads is already useful. Here, even
though JMM talks about total order of all volatile operations, in practice
the order of loads is weaker, as long as this weakening cannot be observed -
eg on x86 enforcing order of loads among themselves is an entirely local
matter.
IRIW extends the store part to many threads, thus guaranteeing total store
order for volatiles. I thought the total ordering of stores would be a more
contentious point (but I agree with the point Hans makes about easier
reasoning).
Alex
On 09/12/2014 21:36, David Holmes wrote:
The "thorn" is the need for the barriers in the readers not the writers.
(or perhaps as well as the writers in some cases - that is part of the
problem.)
David
-----Original Message-----
From: concurrency-interest-bounces at cs.oswego.edu
[mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of Oleksandr
Otenko
Sent: Wednesday, 10 December 2014 6:34 AM
To: dholmes at ieee.org; Hans Boehm
Cc: core-libs-dev; concurrency-interest at cs.oswego.edu
Subject: Re: [concurrency-interest] RFR: 8065804:
JEP171:Clarifications/corrections for fence intrinsics
Is the thorn the many allowed outcomes, or the single disallowed
outcome? (eg order consistency is too strict for stores with no
synchronizes-with between them?)
Alex
On 26/11/2014 02:10, David Holmes wrote:
Hi Hans,
Given IRIW is a thorn in everyone's side and has no known useful
benefit, and can hopefully be killed off in the future, lets not get bogged
down in IRIW. But none of what you say below relates to
multi-copy-atomicity.
Cheers,
David
-----Original Message-----
From: hjkhboehm at gmail.com [mailto:hjkhboehm at gmail.com]On Behalf Of
Hans Boehm
Sent: Wednesday, 26 November 2014 12:04 PM
To: dholmes at ieee.org
Cc: Stephan Diestelhorst; concurrency-interest at cs.oswego.edu;
core-libs-dev
Subject: Re: [concurrency-interest] RFR: 8065804:
JEP171:Clarifications/corrections for fence intrinsics
To be concrete here, on Power, loads can normally be ordered by an
address dependency or light-weight fence (lwsync). However, neither is
enough to prevent the questionable outcome for IRIW, since it doesn't ensure
that the stores in T1 and T2 will be made visible to other threads in a
consistent order. That outcome can be prevented by using heavyweight fences
(sync) instructions between the loads instead. Peter Sewell's group
concluded that to enforce correct volatile behavior on Power, you
essentially need a a heavyweight fence between every pair of volatile
operations on Power. That cannot be understood based on simple ordering
constraints.
As Stephan pointed out, there are similar issues on ARM, but
they're less commonly encountered in a Java implementation. If you're
lucky, you can get to the right implementation recipe by looking at only
reordering, I think.
On Tue, Nov 25, 2014 at 4:36 PM, David Holmes
<davidcholmes at aapt.net.au> wrote:
Stephan Diestelhorst writes:
>
> David Holmes wrote:
> > Stephan Diestelhorst writes:
> > > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans
Boehm:
> > > > I'm no hardware architect, but fundamentally it seems to
me that
> > > >
> > > > load x
> > > > acquire_fence
> > > >
> > > > imposes a much more stringent constraint than
> > > >
> > > > load_acquire x
> > > >
> > > > Consider the case in which the load from x is an L1 hit,
but a
> > > > preceding load (from say y) is a long-latency miss. If
we enforce
> > > > ordering by just waiting for completion of prior
operation, the
> > > > former has to wait for the load from y to complete;
while the
> > > > latter doesn't. I find it hard to believe that this
doesn't leave
> > > > an appreciable amount of performance on the table, at
least for
> > > > some interesting microarchitectures.
> > >
> > > I agree, Hans, that this is a reasonable assumption.
Load_acquire x
> > > does allow roach motel, whereas the acquire fence does
not.
> > >
> > > > In addition, for better or worse, fencing requirements
on at least
> > > > Power are actually driven as much by store atomicity
issues, as by
> > > > the ordering issues discussed in the cookbook. This
was not
> > > > understood in 2005, and unfortunately doesn't seem to
be
> amenable to
> > > > the kind of straightforward explanation as in Doug's
cookbook.
> > >
> > > Coming from a strongly ordered architecture to a weakly
ordered one
> > > myself, I also needed some mental adjustment about store
(multi-copy)
> > > atomicity. I can imagine others will be unaware of this
difference,
> > > too, even in 2014.
> >
> > Sorry I'm missing the connection between fences and
multi-copy
> atomicity.
>
> One example is the classic IRIW. With non-multi copy atomic
stores, but
> ordered (say through a dependency) loads in the following
example:
>
> Memory: foo = bar = 0
> _T1_ _T2_ _T3_
_T4_
> st (foo),1 st (bar),1 ld r1, (bar) ld
r3,(foo)
> <addr dep / local "fence" here>
<addr dep>
> ld r2, (foo) ld
r4, (bar)
>
> You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on
non-multi-copy atomic
> machines. On TSO boxes, this is not possible. That means
that the
> memory fence that will prevent such a behaviour (DMB on ARM)
needs to
> carry some additional oomph in ensuring multi-copy atomicity,
or rather
> prevent you from seeing it (which is the same thing).
I take it as given that any code for which you may have ordering
constraints, must first have basic atomicity properties for
loads and
stores. I would not expect any kind of fence to add
multi-copy-atomicity
where there was none.
David
> Stephan
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
More information about the core-libs-dev
mailing list