[jmm-dev] jdk9 APIs [Fences specifically]

Thu Aug 13 21:04:48 UTC 2015

On Thu, Aug 13, 2015 at 5:19 AM, Doug Lea <dl at cs.oswego.edu> wrote:
>
> On 08/12/2015 06:33 PM, Hans Boehm wrote:
>>
>>
>> Let me argue once more against LoadLoad, and at least dampen the
>> enthusiasm for StoreStore.
>
>
> Thanks for the critiques! (Even though I remain unconvinced.)
>
> I should have noted that ARM mappings are only part of the motivation
> for loadLoadFence and storeStoreFence.  Another is protection against
> loop "optimizations" that are highly non-optimal.  This is not
> strictly a compiler issue, but easier to illustrate as one.  Suppose
> for example you have a method that writes several variables, along
> with reader methods that can handle all ordering races among the
> writes. But you still want to ensure that the variables are actually
> written if the method is called in a loop. A trailing
> storeStoreFence() seems to be the cheapest and conceptually most
> appropriate way to reduce communication latency.  (In other words, it
> is "correct" but undesirable for method c() here to only use the
> final (x, y) values.)  Symmetrical arguments apply to using
> leading loadLoadFences on the complementary reader methods
> (that is otherwise similar to RCU-like constructions).
>
> class C {
>    int x = 0, y = 0; // relaxed
>
>    void p() {      // called in producer thread
>       for (int i = 0; i < 1000000; ++i)
>         writes(heavyPureComputation(i));
>    }
>
>    void c() {      // called in consumer thread
>      for (;;) {
>        if (occasionally)
>          reads();
>        // ...
>     }
>
>    void writes(int k) {
>       x = k;
>       y = k + 17;
>       storeStoreFence(); // please actually store x and y if in a loop
>    }
>
>    void reads() {
>      loadLoadFence();   // please actually load x and y if in a loop
>      if (y == x + 17)
>        something();
>    }
> }
>
> This is not a hypothetical example. It's abstracted from cases I've
> encountered. Like the RCU-like examples mentioned yesterday, these effects
> arise only when you are writing racy performance-critical code.  But
> that's what low-level concurrent algorithm and data structure
> designers do!
No disagreement about the existence of this problem.  There was a recent
long
discussion of this on a C++ mailing list.  There is not yet agreement there
about the correct solution.  But we didn't have any advocates for this
approach.

I think this is fundamentally a completely different problem that has
nothing
to do with restricting order to either only loads or only stores.  You are
trying to instead dissuade the compiler from drastic code movement in
certain
cases.

I don't think a fence-based approach works.  Deferring all the stores to
the end of
the loop fundamentally remains correct, even with the StoreStore fence,
since it's consistent with the producer just running very fast for a while.
The constraint you're trying to enforce has nothing
to do with ordering.

Aside from not working correctly, you end up slowing down ARM code in ways
that are entirely unnecessary, by inserting "dmb ishld" or "dmb ishst"
fences
everywhere.  (How expensive they are varies.  On a number of implementations
they basically seem to be full fences.)

My personal favorite solution to this problem is to add an annotation for
fields that are used as relaxed atomics, and to agree that high quality
compilers
should basically leave those alone.  Optimizing those using conventional
rules
for sequential performance may lead to disastrous performance for the whole
multithreaded system.  If you don't understand it, leave it alone.

Peter Dimov pointed out that there are cases, e.g. consecutive C++
reference count updates, where you probably do want the compiler to
aggressively optimize in spite of concurrent access.  You may need a second
annotation for those.

>
> Back to ..
>
>>
>> I know of no hardware instructions, except on SPARC, that correspond
>> to a LoadLoad fence.  And my impression is that it's not very useful on
>> SPARC.  The ARM DMB xLD fence instruction, if I understand correctly,
>> is essentially a C++ acquire fence.
>
>
> But I think that pseudo-fences (load; compare to self; ...) need not be?
Those are fundamentally LoadStore fences.  On Power you can also turn
them into a LoadLoad fence by adding an isync.  I think the ARM situation
is essentially identical.

>
>>
>> However, I think it difficult to specify correctly outside of that
specific
>> essentially final-field-initialization scenario.
>
>
> It doesn't seem hard at all to specify in isolation.
> The interactions with base ordering rules can be non-obvious though.
> (Especially since, in the absence of a revised base model,
> those rules might as well say that anything goes.)
> So, like any fence method, it should be used when nothing
> simpler applies. And surely not in:
>
>
>>
>> x++; // Increment zero initialized field
>> storeStoreFence();
>> x_init = true;
My problem is that this looks a lot like a constructor fence, or maybe the
writer side of a seqlock, which are the only use cases I know of for
StoreStore
fences.

And the harder I think about constructor fences, the more nervous I get
about using StoreStore fences there without fully understanding the
transitivity issues.

Hans

>
>
> -Doug
>