RFR: 7143664: Clean up OrderAccess implementations and usage

Wed Jan 28 22:48:44 UTC 2015

Updated webrev:

http://cr.openjdk.java.net/~dholmes/7143664/webrev.v2/

Thanks,
David

On 28/01/2015 4:09 PM, David Holmes wrote:
> Hi Erik,
>
> Sorry for the delay, it was a long weekend here and I'm playing catch
> up. :)
>
> On 24/01/2015 9:03 AM, Erik Österlund wrote:
>> Hi Dean,
>>
>> On 23/01/15 22:50, Dean Long wrote:
>>> On 1/23/2015 7:16 AM, Erik Österlund wrote:
>>>> Hi Dean and David,
>>>>
>>>> Thanks for reviewing this change.
>>>>
>>>> On 23/01/15 08:12, David Holmes wrote:
>>>>> On 23/01/2015 4:36 PM, Dean Long wrote:
>>>>>>       147 // Ordering a load relative to preceding stores requires
>>>>>> a fence,
>>>>>>       148 // which implies a membar #StoreLoad between the store
>>>>>> and load under
>>>>>>       149 // sparc-TSO.  A fence is required by x86.  On x86, we
>>>>>> use explicitly
>>>>>>       150 // locked add.
>>>>>>       151 //
>>>>>>
>>>>>> It sounds like the above is saying that fence is the same as
>>>>>> StoreLoad ...
>>>>> That may be a s/store_fence/fence/ typo. The original text was:
>>>>>
>>>>> // Ordering a load relative to preceding stores requires a
>>>>> store_fence,
>>>>> // which implies a membar #StoreLoad between the store and load under
>>>>> // sparc-TSO.  A fence is required by ia64.  On x86, we use locked
>>>>> xchg.
>>>>>
>>>>> Actually seems like a couple of typos there: ia64 became x86, then we
>>>>> have x86 again.
>>>> I tried to remove ia64 from the comments as it is not in the
>>>> repository,
>>>> which in this case became a bit redundant. Are we okay if I remove the
>>>> "A fence is required by x86" sentence?
>>>>
>>>> Oh and Dean about membar #StoreLoad on SPARC and locked add on x86...
>>>> are you saying they are not equivalent?
>>>
>>> I wasn't trying to say anything abou SPARC or x86, only that "fence" !=
>>> "StoreLoad".
>>
>> Agreed.
>>
>>>> New version of that paragraph with that sentence removed:
>>>>
>>>> 147 // Ordering a load relative to preceding stores requires a fence,
>>>> 148 // which implies a membar #StoreLoad between the store and load
>>>> under
>>>> 149 // sparc-TSO. On x86, we use explicitly locked add.
>>>
>>> I would replace "fence" with "StoreLoad" above.
>>
>> Again, I just changed any store_fence descriptions to use fence instead.
>> But you are certainly right, it is more accurate to talk about StoreLoad
>> here and not mention fence in the first place, I'll change this, thanks
>> for the feedback. :)
>>
>> Changed to:
>>
>> 147 // Ordering a load relative to preceding stores requires a StoreLoad,
>> 148 // which implies a membar #StoreLoad between the store and load under
>> 149 // sparc-TSO. On x86, we use explicitly locked add.
>
> The confusion here is what "store-fence" was intended to mean. In this
> part we're implying it is a storeLoad, but in the discussion below it
> indicates store-fence orders subsequent loads and stores - which makes
> it a storeStore|storeLoad barrier. Of course storeLoad subsumes
> storeStore so the above is accurate - though I think the original text
> would have been more clearly stated as "Ordering a load relative to
> preceding stores _can be achieved using a_ store-fence, ..."
>
> So in summary: update above is okay.
>
>>>>>>       152 // 4. store, load  <= is constrained by => store, fence,
>>>>>> load
>>>>>>       153 //
>>>>>>       154 // Use store, fence to make sure all stores done in an
>>>>>> 'interesting'
>>>>>>       155 // region are made visible prior to both subsequent
>>>>>> loads and stores.
>>>>>>
>>>>>> ... and this is like saying to use fence because StoreStore |
>>>>>> StoreLoad
>>>>>> isn't available.
>>>>> Again this seems an issue with store_fence being changed to "store,
>>>>> fence"  which doesn't really make sense.
>>>> The thing is that I removed store_fence (not store_release as it
>>>> says in
>>>> the bug ID, sorry about that - there obviously is no store_release). So
>>>> I wanted to remove references in the comments to store_fence as well.
>>>> Therefore I changed the example to use store fence load instead of
>>>> store_fence load. Perhaps it's better if I remove that sample all
>>>> together instead?
>>>
>>> That works for me.
>>
>> Fixed.
>
> Removal is an option but again replacing store-fence with what it really
> means "store; membar<storeStore|StoreLoad>" would also suffice. But if
> we don't need the combined  membar<storeStore|StoreLoad> such that
> abstracting it into something like store_fence() makes sense, then
> getting rid of everything pertaining to store-fence makes sense. I see
> there are no uses of store_fence() other than a commented out reference
> in ./cpu/x86/vm/c1_LIRAssembler_x86.cpp, which also contains a similar
> commented out reference to the non-existent load_fence() - both of which
> should be removed.
>
> If you'd like to send me an updated patch I will generate a new webrev.
> Then I can get on with the functional part of the review.
>
> Thanks,
> David
>
>> Thanks,
>> /Erik
>>
>>> dl
>>>
>>>> Thanks for reviewing my changes!
>>>>
>>>> /Erik
>>>>
>>>>> David
>>>>>
>>>>>>> Furthermore, store_release was declared private and marked as
>>>>>> deprecated.
>>>>>>
>>>>>> I can't find where this was done.
>>>>>>
>>>>>> dl
>>>>>>
>>>>>> On 1/22/2015 5:19 PM, Erik Österlund wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> == Summary of Changes ==
>>>>>>>
>>>>>>> This changeset fixes issues in OrderAccess on multiple levels
>>>>>>> from the
>>>>>>> memory model semantics to compiler reorderings, to addressing
>>>>>>> maintainability/portability issues which (almost) had to be fixed in
>>>>>>> order to fix the correctness issues. It is the result of discussions
>>>>>>> found in the previous "OrderAccess Refactoring" thread:
>>>>>>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>>>>>>
>>>>>>>
>>>>>>> Bug ID:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-7143664
>>>>>>> (updated to reflect these related changes)
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>>>>>>
>>>>>>> Before I describe more I would like to give special thanks to David
>>>>>>> Holmes for long discussions leading up to the currently proposed
>>>>>>> changes. I would also like to thank Jesper Wilhelmsson for
>>>>>>> helping me
>>>>>>> run my changes through JPRT.
>>>>>>>
>>>>>>> == Motivation ==
>>>>>>>
>>>>>>> This change directly fixes a reported OrderAccess bug due to
>>>>>>> compiler
>>>>>>> reorderings which is still a vulnerability on almost all TSO
>>>>>>> platforms:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-806196
>>>>>>>
>>>>>>> And directly fixes confusions like release_store() != release()
>>>>>>> store()
>>>>>>> due to memory model issues previously described in this bug ID.
>>>>>>>
>>>>>>> At the same time it provides clearer design with separation of
>>>>>>> concerns
>>>>>>> and generalization/specialization, removing a whole bunch of
>>>>>>> platform
>>>>>>> specific code which could be generalized. The platform specific
>>>>>>> files
>>>>>>> now only have a few LoC requirements (~7) to conform to the
>>>>>>> memory model
>>>>>>> by specifying what the stand alone barriers do. Then optionally
>>>>>>> optimizations to the general approach are possible if platforms
>>>>>>> support
>>>>>>> it. This also makes it much easier to port to new platforms.
>>>>>>>
>>>>>>> == Memory Model ==
>>>>>>>
>>>>>>> The current definitions of acquire/release semantics are a bit fishy
>>>>>>> leading to problems previously described in the bug ID
>>>>>>> (release_store()
>>>>>>> != release() store()) and some other correctness issues. It has
>>>>>>> therefore been replaced with a new model. I would like to thank
>>>>>>> David
>>>>>>> Holmes for the long discussions leading up to the newly proposed
>>>>>>> model.
>>>>>>>
>>>>>>> The new model is formally defined like this:
>>>>>>>
>>>>>>> // T1: access_shared_data
>>>>>>> // T1: ]release
>>>>>>> // T1: (...)
>>>>>>> // T1: store(X)
>>>>>>> //
>>>>>>> // T2: load(X)
>>>>>>> // T2: (...)
>>>>>>> // T2: acquire[
>>>>>>> // T2: access_shared_data
>>>>>>> //
>>>>>>> // It is guaranteed that if T2: load(X) synchronizes with
>>>>>>> (observes the
>>>>>>> // value written by) T1: store(X), then the memory accesses
>>>>>>> before the
>>>>>>> // T1: ]release happen before the memory accesses after the T2:
>>>>>>> acquire[.
>>>>>>>
>>>>>>> The orderAccess.hpp file and bug ID also has a few additional
>>>>>>> explanations making it more intuitive to the user when to use
>>>>>>> acquire/release and the resemblance to TSO abstract machines. Big
>>>>>>> thanks
>>>>>>> goes to David Holmes for discussing the memory model with me, which
>>>>>>> helped a lot in deriving it.
>>>>>>>
>>>>>>> Now it holds that release() store() == release_store(), and the
>>>>>>> other
>>>>>>> correctness issues are fixed as well.
>>>>>>>
>>>>>>> The new model is also closer to C++11 definitions which could
>>>>>>> give us
>>>>>>> more relaxed compiler reordering constraints in the future when
>>>>>>> compiler
>>>>>>> support for C++11 is there and ready.
>>>>>>>
>>>>>>> == Reliance on C++ Volatile Semantics ==
>>>>>>>
>>>>>>> The C++ standard section 1.9 "Program Execution" is very vague about
>>>>>>> what the keyword volatile can actually do for us. It gives clear
>>>>>>> constraints in terms of volatile-volatile accesses but says
>>>>>>> little about
>>>>>>> nonvolatile-volatile accesses. Yet the current implementation
>>>>>>> heavily
>>>>>>> relies upon volatile to in terms of compiler reordering. But GCC
>>>>>>> explicitly declares that volatiles and non-volatiles may reorder
>>>>>>> freely
>>>>>>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only
>>>>>>> compiler
>>>>>>> known to explicitly provide the wanted semantics with volatile is
>>>>>>> MSVC
>>>>>>> 2010 for windows (
>>>>>>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>>>>>>> Compilers not giving explicit guarantees, must be considered
>>>>>>> unsafe and
>>>>>>> revert to conservative behaviour.
>>>>>>>
>>>>>>> This was brought to attention after causing bugs, but was only
>>>>>>> fixed for
>>>>>>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>>>>>>> except windows, and has to be fixed on all of them.
>>>>>>>
>>>>>>> Several barriers are unsafe to use because they lack compiler
>>>>>>> reordering
>>>>>>> constraints (e.g. fence and acquire on linux_SPARC). For TSO
>>>>>>> platforms
>>>>>>> they are typically implemented using dummy loads and stores. This
>>>>>>> seems
>>>>>>> to be another old volatile reliance that I fixed. These barriers
>>>>>>> sometimes have omitted compiler barriers (which is what we really
>>>>>>> want).
>>>>>>> This seems to be another example on incorrect reliance on the
>>>>>>> volatile
>>>>>>> semantics to help us. Therefore dummy loads/stores have been
>>>>>>> replaced
>>>>>>> with compiler barriers on TSO platforms.
>>>>>>>
>>>>>>> It is also worth noting that compilers like sun studio did
>>>>>>> previously
>>>>>>> not support inline asm syntax. Therefore, barriers were
>>>>>>> implemented in
>>>>>>> .il-files. However, using them does not give explicit compiler
>>>>>>> constraints for reordering AFAIK. Therefore, they have been
>>>>>>> reimplemented using inline asm with explicit compiler reordering
>>>>>>> constraints, as even sun (solaris?) studio now supports this.
>>>>>>>
>>>>>>> The windows variants have added a windows-style _ReadWriteBarrier()
>>>>>>> compiler barrier similarly.
>>>>>>>
>>>>>>> == Strange Hardware Reorderings ==
>>>>>>>
>>>>>>> Fixed a weird inconsistency where acquire, loadstore and loadlaod
>>>>>>> would
>>>>>>> use isync instead of lwsync for PPC on linux_zero, but not in any
>>>>>>> other
>>>>>>> PPC platform in the repo. I assumed this is wrong and changed it to
>>>>>>> lwsync instead.
>>>>>>>
>>>>>>> == Code Redundancy and Refactoring ==
>>>>>>>
>>>>>>> The OrderAccess code looks like it has been developed over a long
>>>>>>> period
>>>>>>> of time, with small incremental changes. This seems to have led
>>>>>>> to a lot
>>>>>>> of code duplication over time. For example, store_fence variants
>>>>>>> are not
>>>>>>> referenced from anywhere, yet contribute to a lot of the code
>>>>>>> base and a
>>>>>>> lot of awkwardness (such as being the one only exception not using
>>>>>>> volatiles for some reason). Moreover, store_fence is not used
>>>>>>> anywhere
>>>>>>> in hotspot because it is not a good fit for either the
>>>>>>> acquire/release
>>>>>>> semantics or the Java volatile semantics, leaving a big question
>>>>>>> mark on
>>>>>>> when it should ever be used. I took the liberty of removing it.
>>>>>>>
>>>>>>> Another redundancy issue is that most of the semantics is exactly
>>>>>>> the
>>>>>>> same for all platforms, yet all that default boilerplate such as
>>>>>>> how to
>>>>>>> make atomic accesses, where acquire/release is supposed to be placed
>>>>>>> w.r.t. the memory access, what the different barriers should do
>>>>>>> etc. is
>>>>>>> copied in redundantly for each os_cpu and each type of memory
>>>>>>> access for
>>>>>>> each os_cpu. This makes it extremely painful 1) to understand what
>>>>>>> actually defines a certain platform compared to the defaults and
>>>>>>> 2) to
>>>>>>> correct bugs like those discovered here 3) prevent silly mistakes
>>>>>>> and
>>>>>>> bugs, by simply having a lot less code defining the behaviour of
>>>>>>> OrderAccess that could go wrong.
>>>>>>>
>>>>>>> A new architecture/design for OrderAccess is proposed, using a
>>>>>>> generalization/specialization approach.
>>>>>>>
>>>>>>> A general implementation in /share/ defines how things work and
>>>>>>> splits
>>>>>>> into different concerns: 1) how to make an atomic memory access, 2)
>>>>>>> where to but barriers w.r.t. the memory access for things like
>>>>>>> load_acquire, release_store and release_store_fence, 3) how these
>>>>>>> barriers are defined.
>>>>>>>
>>>>>>> This allows a clear split between what is required for following the
>>>>>>> specifications, and optimizations, which become much more
>>>>>>> readable and
>>>>>>> only optimizations need to be reviewed in depth as the defaults can
>>>>>>> always be trusted given correct standalone barriers.
>>>>>>>
>>>>>>> The only thing a platform is required to specify, is what an
>>>>>>> implementation of acquire(), release() and fence() should do. If
>>>>>>> they
>>>>>>> are implemented properly, everything in OrderAccess is guaranteed to
>>>>>>> work according to specification using the generalized code. This
>>>>>>> makes
>>>>>>> it very easy to support new ports. ALL the other code in the os_cpu
>>>>>>> files is used /only/ for optimization purposes offered for specific
>>>>>>> configurations.
>>>>>>>
>>>>>>> However, it is highly customizable so that specific platform can
>>>>>>> perform
>>>>>>> any desired optimizations. For instance this load_acquire on PPC is
>>>>>>> optimized:
>>>>>>>
>>>>>>> template<> inline jbyte
>>>>>>> OrderAccess::specialized_load_acquire<jbyte>
>>>>>>> (volatile jbyte*  p) { register jbyte t = load(p);
>>>>>>> inlasm_acquire_reg(t); return t; }
>>>>>>>
>>>>>>> This overrides the whole load_acquire implementation to do something
>>>>>>> custom. Platforms like x86 extensively use this for joined fencing
>>>>>>> variants to optimize.
>>>>>>>
>>>>>>> The default implementation of load_acquire() will make an atomic
>>>>>>> load()
>>>>>>> followed by acquire() since the semantics is generalized. The
>>>>>>> generalized semantics are defined using inlined postfix/prefix calls
>>>>>>> after/before the atomic access, as defined here:
>>>>>>>
>>>>>>> template<> inline void
>>>>>>> ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>>>>>>> OrderAccess::acquire(); }
>>>>>>> template<> inline void
>>>>>>> ScopedFenceGeneral<RELEASE_X>::prefix()        {
>>>>>>> OrderAccess::release(); }
>>>>>>> template<> inline void
>>>>>>> ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>>>>>>> OrderAccess::release(); }
>>>>>>> template<> inline void
>>>>>>> ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>>>>>>> OrderAccess::fence();   }
>>>>>>>
>>>>>>> For platforms that simply wish to override what e.g. acquire
>>>>>>> means for a
>>>>>>> joined ordered memory access in general, as different to calling
>>>>>>> stand
>>>>>>> alone acquire(), the semantics can be easily overridden for a
>>>>>>> platform
>>>>>>> such as windows like on windows:
>>>>>>>
>>>>>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>>>>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>>>>>> OrderAccess::fence(); }
>>>>>>>
>>>>>>> In this example, since Windows (now) has a compiler barrier for
>>>>>>> acquire,
>>>>>>> but does not need it for joined accesses since volatile has stronger
>>>>>>> guarantees on windows, this is enough to specialize that for joined
>>>>>>> memory accesses, no extra protection is needed.
>>>>>>>
>>>>>>> == Backward Compatibility and Transitioning ==
>>>>>>>
>>>>>>> Since the newly proposed code is structured differently to before, a
>>>>>>> #define was added for backward compatibility so that external
>>>>>>> repositories not adhering to this new architecture do not break.
>>>>>>> Furthermore, store_release was declared private and marked as
>>>>>>> deprecated. This allows for a smooth transition into the new
>>>>>>> style of
>>>>>>> OrderAccess. When the full transition is made in all known repos,
>>>>>>> the
>>>>>>> #define and store_fence may be safely removed, eventually.
>>>>>>>
>>>>>>> == Documentation ==
>>>>>>>
>>>>>>> The documentation seems old and outdated, describing how it works on
>>>>>>> SPARC RMO and IA64, which are nowhere to be found in the
>>>>>>> repository. It
>>>>>>> also describes things about C++ volatiles which cannot be relied
>>>>>>> upon.
>>>>>>> The documentation has been cleaned up to match the current state
>>>>>>> of the
>>>>>>> implementation better, with architectures actually found in the
>>>>>>> repository.
>>>>>>>
>>>>>>> == Testing ==
>>>>>>>
>>>>>>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these
>>>>>>> changes.
>>>>>>> Ran some DaCapo benchmarks (I know okay :p) for performance
>>>>>>> regression
>>>>>>> and there was no perceivable difference.
>>>>>>>
>>>>>>> Looking forward to feedback on this, and hope to get some
>>>>>>> reviews. :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Erik
>>>
>>>