RFR: 7143664: Clean up OrderAccess implementations and usage

Fri Jan 23 15:16:55 UTC 2015

Hi Dean and David,

Thanks for reviewing this change.

On 23/01/15 08:12, David Holmes wrote:
> On 23/01/2015 4:36 PM, Dean Long wrote:
>>    147 // Ordering a load relative to preceding stores requires a fence,
>>    148 // which implies a membar #StoreLoad between the store and load under
>>    149 // sparc-TSO.  A fence is required by x86.  On x86, we use explicitly
>>    150 // locked add.
>>    151 //
>>
>> It sounds like the above is saying that fence is the same as StoreLoad ...
>
> That may be a s/store_fence/fence/ typo. The original text was:
>
> // Ordering a load relative to preceding stores requires a store_fence,
> // which implies a membar #StoreLoad between the store and load under
> // sparc-TSO.  A fence is required by ia64.  On x86, we use locked xchg.
>
> Actually seems like a couple of typos there: ia64 became x86, then we
> have x86 again.

I tried to remove ia64 from the comments as it is not in the repository, 
which in this case became a bit redundant. Are we okay if I remove the 
"A fence is required by x86" sentence?

Oh and Dean about membar #StoreLoad on SPARC and locked add on x86... 
are you saying they are not equivalent?

New version of that paragraph with that sentence removed:

147 // Ordering a load relative to preceding stores requires a fence,
148 // which implies a membar #StoreLoad between the store and load under
149 // sparc-TSO. On x86, we use explicitly locked add.

>>    152 // 4. store, load  <= is constrained by => store, fence, load
>>    153 //
>>    154 // Use store, fence to make sure all stores done in an 'interesting'
>>    155 // region are made visible prior to both subsequent loads and stores.
>>
>> ... and this is like saying to use fence because StoreStore | StoreLoad
>> isn't available.
>
> Again this seems an issue with store_fence being changed to "store,
> fence"  which doesn't really make sense.

The thing is that I removed store_fence (not store_release as it says in 
the bug ID, sorry about that - there obviously is no store_release). So 
I wanted to remove references in the comments to store_fence as well.
Therefore I changed the example to use store fence load instead of 
store_fence load. Perhaps it's better if I remove that sample all 
together instead?

Thanks for reviewing my changes!

/Erik

> David
>
>>
>>> Furthermore, store_release was declared private and marked as
>> deprecated.
>>
>> I can't find where this was done.
>>
>> dl
>>
>> On 1/22/2015 5:19 PM, Erik Österlund wrote:
>>> Hi all,
>>>
>>> == Summary of Changes ==
>>>
>>> This changeset fixes issues in OrderAccess on multiple levels from the
>>> memory model semantics to compiler reorderings, to addressing
>>> maintainability/portability issues which (almost) had to be fixed in
>>> order to fix the correctness issues. It is the result of discussions
>>> found in the previous "OrderAccess Refactoring" thread:
>>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>>
>>> Bug ID:
>>> https://bugs.openjdk.java.net/browse/JDK-7143664
>>> (updated to reflect these related changes)
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>>
>>> Before I describe more I would like to give special thanks to David
>>> Holmes for long discussions leading up to the currently proposed
>>> changes. I would also like to thank Jesper Wilhelmsson for helping me
>>> run my changes through JPRT.
>>>
>>> == Motivation ==
>>>
>>> This change directly fixes a reported OrderAccess bug due to compiler
>>> reorderings which is still a vulnerability on almost all TSO platforms:
>>> https://bugs.openjdk.java.net/browse/JDK-806196
>>>
>>> And directly fixes confusions like release_store() != release() store()
>>> due to memory model issues previously described in this bug ID.
>>>
>>> At the same time it provides clearer design with separation of concerns
>>> and generalization/specialization, removing a whole bunch of platform
>>> specific code which could be generalized. The platform specific files
>>> now only have a few LoC requirements (~7) to conform to the memory model
>>> by specifying what the stand alone barriers do. Then optionally
>>> optimizations to the general approach are possible if platforms support
>>> it. This also makes it much easier to port to new platforms.
>>>
>>> == Memory Model ==
>>>
>>> The current definitions of acquire/release semantics are a bit fishy
>>> leading to problems previously described in the bug ID (release_store()
>>> != release() store()) and some other correctness issues. It has
>>> therefore been replaced with a new model. I would like to thank David
>>> Holmes for the long discussions leading up to the newly proposed model.
>>>
>>> The new model is formally defined like this:
>>>
>>> // T1: access_shared_data
>>> // T1: ]release
>>> // T1: (...)
>>> // T1: store(X)
>>> //
>>> // T2: load(X)
>>> // T2: (...)
>>> // T2: acquire[
>>> // T2: access_shared_data
>>> //
>>> // It is guaranteed that if T2: load(X) synchronizes with (observes the
>>> // value written by) T1: store(X), then the memory accesses before the
>>> // T1: ]release happen before the memory accesses after the T2: acquire[.
>>>
>>> The orderAccess.hpp file and bug ID also has a few additional
>>> explanations making it more intuitive to the user when to use
>>> acquire/release and the resemblance to TSO abstract machines. Big thanks
>>> goes to David Holmes for discussing the memory model with me, which
>>> helped a lot in deriving it.
>>>
>>> Now it holds that release() store() == release_store(), and the other
>>> correctness issues are fixed as well.
>>>
>>> The new model is also closer to C++11 definitions which could give us
>>> more relaxed compiler reordering constraints in the future when compiler
>>> support for C++11 is there and ready.
>>>
>>> == Reliance on C++ Volatile Semantics ==
>>>
>>> The C++ standard section 1.9 "Program Execution" is very vague about
>>> what the keyword volatile can actually do for us. It gives clear
>>> constraints in terms of volatile-volatile accesses but says little about
>>> nonvolatile-volatile accesses. Yet the current implementation heavily
>>> relies upon volatile to in terms of compiler reordering. But GCC
>>> explicitly declares that volatiles and non-volatiles may reorder freely
>>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler
>>> known to explicitly provide the wanted semantics with volatile is MSVC
>>> 2010 for windows (
>>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>>> Compilers not giving explicit guarantees, must be considered unsafe and
>>> revert to conservative behaviour.
>>>
>>> This was brought to attention after causing bugs, but was only fixed for
>>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>>> except windows, and has to be fixed on all of them.
>>>
>>> Several barriers are unsafe to use because they lack compiler reordering
>>> constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms
>>> they are typically implemented using dummy loads and stores. This seems
>>> to be another old volatile reliance that I fixed. These barriers
>>> sometimes have omitted compiler barriers (which is what we really want).
>>> This seems to be another example on incorrect reliance on the volatile
>>> semantics to help us. Therefore dummy loads/stores have been replaced
>>> with compiler barriers on TSO platforms.
>>>
>>> It is also worth noting that compilers like sun studio did previously
>>> not support inline asm syntax. Therefore, barriers were implemented in
>>> .il-files. However, using them does not give explicit compiler
>>> constraints for reordering AFAIK. Therefore, they have been
>>> reimplemented using inline asm with explicit compiler reordering
>>> constraints, as even sun (solaris?) studio now supports this.
>>>
>>> The windows variants have added a windows-style _ReadWriteBarrier()
>>> compiler barrier similarly.
>>>
>>> == Strange Hardware Reorderings ==
>>>
>>> Fixed a weird inconsistency where acquire, loadstore and loadlaod would
>>> use isync instead of lwsync for PPC on linux_zero, but not in any other
>>> PPC platform in the repo. I assumed this is wrong and changed it to
>>> lwsync instead.
>>>
>>> == Code Redundancy and Refactoring ==
>>>
>>> The OrderAccess code looks like it has been developed over a long period
>>> of time, with small incremental changes. This seems to have led to a lot
>>> of code duplication over time. For example, store_fence variants are not
>>> referenced from anywhere, yet contribute to a lot of the code base and a
>>> lot of awkwardness (such as being the one only exception not using
>>> volatiles for some reason). Moreover, store_fence is not used anywhere
>>> in hotspot because it is not a good fit for either the acquire/release
>>> semantics or the Java volatile semantics, leaving a big question mark on
>>> when it should ever be used. I took the liberty of removing it.
>>>
>>> Another redundancy issue is that most of the semantics is exactly the
>>> same for all platforms, yet all that default boilerplate such as how to
>>> make atomic accesses, where acquire/release is supposed to be placed
>>> w.r.t. the memory access, what the different barriers should do etc. is
>>> copied in redundantly for each os_cpu and each type of memory access for
>>> each os_cpu. This makes it extremely painful 1) to understand what
>>> actually defines a certain platform compared to the defaults and 2) to
>>> correct bugs like those discovered here 3) prevent silly mistakes and
>>> bugs, by simply having a lot less code defining the behaviour of
>>> OrderAccess that could go wrong.
>>>
>>> A new architecture/design for OrderAccess is proposed, using a
>>> generalization/specialization approach.
>>>
>>> A general implementation in /share/ defines how things work and splits
>>> into different concerns: 1) how to make an atomic memory access, 2)
>>> where to but barriers w.r.t. the memory access for things like
>>> load_acquire, release_store and release_store_fence, 3) how these
>>> barriers are defined.
>>>
>>> This allows a clear split between what is required for following the
>>> specifications, and optimizations, which become much more readable and
>>> only optimizations need to be reviewed in depth as the defaults can
>>> always be trusted given correct standalone barriers.
>>>
>>> The only thing a platform is required to specify, is what an
>>> implementation of acquire(), release() and fence() should do. If they
>>> are implemented properly, everything in OrderAccess is guaranteed to
>>> work according to specification using the generalized code. This makes
>>> it very easy to support new ports. ALL the other code in the os_cpu
>>> files is used /only/ for optimization purposes offered for specific
>>> configurations.
>>>
>>> However, it is highly customizable so that specific platform can perform
>>> any desired optimizations. For instance this load_acquire on PPC is
>>> optimized:
>>>
>>> template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte>
>>> (volatile jbyte*  p) { register jbyte t = load(p);
>>> inlasm_acquire_reg(t); return t; }
>>>
>>> This overrides the whole load_acquire implementation to do something
>>> custom. Platforms like x86 extensively use this for joined fencing
>>> variants to optimize.
>>>
>>> The default implementation of load_acquire() will make an atomic load()
>>> followed by acquire() since the semantics is generalized. The
>>> generalized semantics are defined using inlined postfix/prefix calls
>>> after/before the atomic access, as defined here:
>>>
>>> template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>>> OrderAccess::acquire(); }
>>> template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        {
>>> OrderAccess::release(); }
>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>>> OrderAccess::release(); }
>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>>> OrderAccess::fence();   }
>>>
>>> For platforms that simply wish to override what e.g. acquire means for a
>>> joined ordered memory access in general, as different to calling stand
>>> alone acquire(), the semantics can be easily overridden for a platform
>>> such as windows like on windows:
>>>
>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>> OrderAccess::fence(); }
>>>
>>> In this example, since Windows (now) has a compiler barrier for acquire,
>>> but does not need it for joined accesses since volatile has stronger
>>> guarantees on windows, this is enough to specialize that for joined
>>> memory accesses, no extra protection is needed.
>>>
>>> == Backward Compatibility and Transitioning ==
>>>
>>> Since the newly proposed code is structured differently to before, a
>>> #define was added for backward compatibility so that external
>>> repositories not adhering to this new architecture do not break.
>>> Furthermore, store_release was declared private and marked as
>>> deprecated. This allows for a smooth transition into the new style of
>>> OrderAccess. When the full transition is made in all known repos, the
>>> #define and store_fence may be safely removed, eventually.
>>>
>>> == Documentation ==
>>>
>>> The documentation seems old and outdated, describing how it works on
>>> SPARC RMO and IA64, which are nowhere to be found in the repository. It
>>> also describes things about C++ volatiles which cannot be relied upon.
>>> The documentation has been cleaned up to match the current state of the
>>> implementation better, with architectures actually found in the
>>> repository.
>>>
>>> == Testing ==
>>>
>>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
>>> Ran some DaCapo benchmarks (I know okay :p) for performance regression
>>> and there was no perceivable difference.
>>>
>>> Looking forward to feedback on this, and hope to get some reviews. :)
>>>
>>> Thanks,
>>> Erik
>>
>