RFR: 7143664: Clean up OrderAccess implementations and usage

Fri Jan 23 23:03:26 UTC 2015

Hi Dean,

On 23/01/15 22:50, Dean Long wrote:
> On 1/23/2015 7:16 AM, Erik Österlund wrote:
>> Hi Dean and David,
>>
>> Thanks for reviewing this change.
>>
>> On 23/01/15 08:12, David Holmes wrote:
>>> On 23/01/2015 4:36 PM, Dean Long wrote:
>>>>      147 // Ordering a load relative to preceding stores requires a fence,
>>>>      148 // which implies a membar #StoreLoad between the store and load under
>>>>      149 // sparc-TSO.  A fence is required by x86.  On x86, we use explicitly
>>>>      150 // locked add.
>>>>      151 //
>>>>
>>>> It sounds like the above is saying that fence is the same as StoreLoad ...
>>> That may be a s/store_fence/fence/ typo. The original text was:
>>>
>>> // Ordering a load relative to preceding stores requires a store_fence,
>>> // which implies a membar #StoreLoad between the store and load under
>>> // sparc-TSO.  A fence is required by ia64.  On x86, we use locked xchg.
>>>
>>> Actually seems like a couple of typos there: ia64 became x86, then we
>>> have x86 again.
>> I tried to remove ia64 from the comments as it is not in the repository,
>> which in this case became a bit redundant. Are we okay if I remove the
>> "A fence is required by x86" sentence?
>>
>> Oh and Dean about membar #StoreLoad on SPARC and locked add on x86...
>> are you saying they are not equivalent?
>
> I wasn't trying to say anything abou SPARC or x86, only that "fence" !=
> "StoreLoad".

Agreed.

>> New version of that paragraph with that sentence removed:
>>
>> 147 // Ordering a load relative to preceding stores requires a fence,
>> 148 // which implies a membar #StoreLoad between the store and load under
>> 149 // sparc-TSO. On x86, we use explicitly locked add.
>
> I would replace "fence" with "StoreLoad" above.

Again, I just changed any store_fence descriptions to use fence instead. 
But you are certainly right, it is more accurate to talk about StoreLoad 
here and not mention fence in the first place, I'll change this, thanks 
for the feedback. :)

Changed to:

147 // Ordering a load relative to preceding stores requires a StoreLoad,
148 // which implies a membar #StoreLoad between the store and load under
149 // sparc-TSO. On x86, we use explicitly locked add.

>>>>      152 // 4. store, load  <= is constrained by => store, fence, load
>>>>      153 //
>>>>      154 // Use store, fence to make sure all stores done in an 'interesting'
>>>>      155 // region are made visible prior to both subsequent loads and stores.
>>>>
>>>> ... and this is like saying to use fence because StoreStore | StoreLoad
>>>> isn't available.
>>> Again this seems an issue with store_fence being changed to "store,
>>> fence"  which doesn't really make sense.
>> The thing is that I removed store_fence (not store_release as it says in
>> the bug ID, sorry about that - there obviously is no store_release). So
>> I wanted to remove references in the comments to store_fence as well.
>> Therefore I changed the example to use store fence load instead of
>> store_fence load. Perhaps it's better if I remove that sample all
>> together instead?
>
> That works for me.

Fixed.

Thanks,
/Erik

> dl
>
>> Thanks for reviewing my changes!
>>
>> /Erik
>>
>>> David
>>>
>>>>> Furthermore, store_release was declared private and marked as
>>>> deprecated.
>>>>
>>>> I can't find where this was done.
>>>>
>>>> dl
>>>>
>>>> On 1/22/2015 5:19 PM, Erik Österlund wrote:
>>>>> Hi all,
>>>>>
>>>>> == Summary of Changes ==
>>>>>
>>>>> This changeset fixes issues in OrderAccess on multiple levels from the
>>>>> memory model semantics to compiler reorderings, to addressing
>>>>> maintainability/portability issues which (almost) had to be fixed in
>>>>> order to fix the correctness issues. It is the result of discussions
>>>>> found in the previous "OrderAccess Refactoring" thread:
>>>>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>>>>
>>>>> Bug ID:
>>>>> https://bugs.openjdk.java.net/browse/JDK-7143664
>>>>> (updated to reflect these related changes)
>>>>>
>>>>> Webrev:
>>>>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>>>>
>>>>> Before I describe more I would like to give special thanks to David
>>>>> Holmes for long discussions leading up to the currently proposed
>>>>> changes. I would also like to thank Jesper Wilhelmsson for helping me
>>>>> run my changes through JPRT.
>>>>>
>>>>> == Motivation ==
>>>>>
>>>>> This change directly fixes a reported OrderAccess bug due to compiler
>>>>> reorderings which is still a vulnerability on almost all TSO platforms:
>>>>> https://bugs.openjdk.java.net/browse/JDK-806196
>>>>>
>>>>> And directly fixes confusions like release_store() != release() store()
>>>>> due to memory model issues previously described in this bug ID.
>>>>>
>>>>> At the same time it provides clearer design with separation of concerns
>>>>> and generalization/specialization, removing a whole bunch of platform
>>>>> specific code which could be generalized. The platform specific files
>>>>> now only have a few LoC requirements (~7) to conform to the memory model
>>>>> by specifying what the stand alone barriers do. Then optionally
>>>>> optimizations to the general approach are possible if platforms support
>>>>> it. This also makes it much easier to port to new platforms.
>>>>>
>>>>> == Memory Model ==
>>>>>
>>>>> The current definitions of acquire/release semantics are a bit fishy
>>>>> leading to problems previously described in the bug ID (release_store()
>>>>> != release() store()) and some other correctness issues. It has
>>>>> therefore been replaced with a new model. I would like to thank David
>>>>> Holmes for the long discussions leading up to the newly proposed model.
>>>>>
>>>>> The new model is formally defined like this:
>>>>>
>>>>> // T1: access_shared_data
>>>>> // T1: ]release
>>>>> // T1: (...)
>>>>> // T1: store(X)
>>>>> //
>>>>> // T2: load(X)
>>>>> // T2: (...)
>>>>> // T2: acquire[
>>>>> // T2: access_shared_data
>>>>> //
>>>>> // It is guaranteed that if T2: load(X) synchronizes with (observes the
>>>>> // value written by) T1: store(X), then the memory accesses before the
>>>>> // T1: ]release happen before the memory accesses after the T2: acquire[.
>>>>>
>>>>> The orderAccess.hpp file and bug ID also has a few additional
>>>>> explanations making it more intuitive to the user when to use
>>>>> acquire/release and the resemblance to TSO abstract machines. Big thanks
>>>>> goes to David Holmes for discussing the memory model with me, which
>>>>> helped a lot in deriving it.
>>>>>
>>>>> Now it holds that release() store() == release_store(), and the other
>>>>> correctness issues are fixed as well.
>>>>>
>>>>> The new model is also closer to C++11 definitions which could give us
>>>>> more relaxed compiler reordering constraints in the future when compiler
>>>>> support for C++11 is there and ready.
>>>>>
>>>>> == Reliance on C++ Volatile Semantics ==
>>>>>
>>>>> The C++ standard section 1.9 "Program Execution" is very vague about
>>>>> what the keyword volatile can actually do for us. It gives clear
>>>>> constraints in terms of volatile-volatile accesses but says little about
>>>>> nonvolatile-volatile accesses. Yet the current implementation heavily
>>>>> relies upon volatile to in terms of compiler reordering. But GCC
>>>>> explicitly declares that volatiles and non-volatiles may reorder freely
>>>>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler
>>>>> known to explicitly provide the wanted semantics with volatile is MSVC
>>>>> 2010 for windows (
>>>>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>>>>> Compilers not giving explicit guarantees, must be considered unsafe and
>>>>> revert to conservative behaviour.
>>>>>
>>>>> This was brought to attention after causing bugs, but was only fixed for
>>>>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>>>>> except windows, and has to be fixed on all of them.
>>>>>
>>>>> Several barriers are unsafe to use because they lack compiler reordering
>>>>> constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms
>>>>> they are typically implemented using dummy loads and stores. This seems
>>>>> to be another old volatile reliance that I fixed. These barriers
>>>>> sometimes have omitted compiler barriers (which is what we really want).
>>>>> This seems to be another example on incorrect reliance on the volatile
>>>>> semantics to help us. Therefore dummy loads/stores have been replaced
>>>>> with compiler barriers on TSO platforms.
>>>>>
>>>>> It is also worth noting that compilers like sun studio did previously
>>>>> not support inline asm syntax. Therefore, barriers were implemented in
>>>>> .il-files. However, using them does not give explicit compiler
>>>>> constraints for reordering AFAIK. Therefore, they have been
>>>>> reimplemented using inline asm with explicit compiler reordering
>>>>> constraints, as even sun (solaris?) studio now supports this.
>>>>>
>>>>> The windows variants have added a windows-style _ReadWriteBarrier()
>>>>> compiler barrier similarly.
>>>>>
>>>>> == Strange Hardware Reorderings ==
>>>>>
>>>>> Fixed a weird inconsistency where acquire, loadstore and loadlaod would
>>>>> use isync instead of lwsync for PPC on linux_zero, but not in any other
>>>>> PPC platform in the repo. I assumed this is wrong and changed it to
>>>>> lwsync instead.
>>>>>
>>>>> == Code Redundancy and Refactoring ==
>>>>>
>>>>> The OrderAccess code looks like it has been developed over a long period
>>>>> of time, with small incremental changes. This seems to have led to a lot
>>>>> of code duplication over time. For example, store_fence variants are not
>>>>> referenced from anywhere, yet contribute to a lot of the code base and a
>>>>> lot of awkwardness (such as being the one only exception not using
>>>>> volatiles for some reason). Moreover, store_fence is not used anywhere
>>>>> in hotspot because it is not a good fit for either the acquire/release
>>>>> semantics or the Java volatile semantics, leaving a big question mark on
>>>>> when it should ever be used. I took the liberty of removing it.
>>>>>
>>>>> Another redundancy issue is that most of the semantics is exactly the
>>>>> same for all platforms, yet all that default boilerplate such as how to
>>>>> make atomic accesses, where acquire/release is supposed to be placed
>>>>> w.r.t. the memory access, what the different barriers should do etc. is
>>>>> copied in redundantly for each os_cpu and each type of memory access for
>>>>> each os_cpu. This makes it extremely painful 1) to understand what
>>>>> actually defines a certain platform compared to the defaults and 2) to
>>>>> correct bugs like those discovered here 3) prevent silly mistakes and
>>>>> bugs, by simply having a lot less code defining the behaviour of
>>>>> OrderAccess that could go wrong.
>>>>>
>>>>> A new architecture/design for OrderAccess is proposed, using a
>>>>> generalization/specialization approach.
>>>>>
>>>>> A general implementation in /share/ defines how things work and splits
>>>>> into different concerns: 1) how to make an atomic memory access, 2)
>>>>> where to but barriers w.r.t. the memory access for things like
>>>>> load_acquire, release_store and release_store_fence, 3) how these
>>>>> barriers are defined.
>>>>>
>>>>> This allows a clear split between what is required for following the
>>>>> specifications, and optimizations, which become much more readable and
>>>>> only optimizations need to be reviewed in depth as the defaults can
>>>>> always be trusted given correct standalone barriers.
>>>>>
>>>>> The only thing a platform is required to specify, is what an
>>>>> implementation of acquire(), release() and fence() should do. If they
>>>>> are implemented properly, everything in OrderAccess is guaranteed to
>>>>> work according to specification using the generalized code. This makes
>>>>> it very easy to support new ports. ALL the other code in the os_cpu
>>>>> files is used /only/ for optimization purposes offered for specific
>>>>> configurations.
>>>>>
>>>>> However, it is highly customizable so that specific platform can perform
>>>>> any desired optimizations. For instance this load_acquire on PPC is
>>>>> optimized:
>>>>>
>>>>> template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte>
>>>>> (volatile jbyte*  p) { register jbyte t = load(p);
>>>>> inlasm_acquire_reg(t); return t; }
>>>>>
>>>>> This overrides the whole load_acquire implementation to do something
>>>>> custom. Platforms like x86 extensively use this for joined fencing
>>>>> variants to optimize.
>>>>>
>>>>> The default implementation of load_acquire() will make an atomic load()
>>>>> followed by acquire() since the semantics is generalized. The
>>>>> generalized semantics are defined using inlined postfix/prefix calls
>>>>> after/before the atomic access, as defined here:
>>>>>
>>>>> template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>>>>> OrderAccess::acquire(); }
>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        {
>>>>> OrderAccess::release(); }
>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>>>>> OrderAccess::release(); }
>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>>>>> OrderAccess::fence();   }
>>>>>
>>>>> For platforms that simply wish to override what e.g. acquire means for a
>>>>> joined ordered memory access in general, as different to calling stand
>>>>> alone acquire(), the semantics can be easily overridden for a platform
>>>>> such as windows like on windows:
>>>>>
>>>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>>>> OrderAccess::fence(); }
>>>>>
>>>>> In this example, since Windows (now) has a compiler barrier for acquire,
>>>>> but does not need it for joined accesses since volatile has stronger
>>>>> guarantees on windows, this is enough to specialize that for joined
>>>>> memory accesses, no extra protection is needed.
>>>>>
>>>>> == Backward Compatibility and Transitioning ==
>>>>>
>>>>> Since the newly proposed code is structured differently to before, a
>>>>> #define was added for backward compatibility so that external
>>>>> repositories not adhering to this new architecture do not break.
>>>>> Furthermore, store_release was declared private and marked as
>>>>> deprecated. This allows for a smooth transition into the new style of
>>>>> OrderAccess. When the full transition is made in all known repos, the
>>>>> #define and store_fence may be safely removed, eventually.
>>>>>
>>>>> == Documentation ==
>>>>>
>>>>> The documentation seems old and outdated, describing how it works on
>>>>> SPARC RMO and IA64, which are nowhere to be found in the repository. It
>>>>> also describes things about C++ volatiles which cannot be relied upon.
>>>>> The documentation has been cleaned up to match the current state of the
>>>>> implementation better, with architectures actually found in the
>>>>> repository.
>>>>>
>>>>> == Testing ==
>>>>>
>>>>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
>>>>> Ran some DaCapo benchmarks (I know okay :p) for performance regression
>>>>> and there was no perceivable difference.
>>>>>
>>>>> Looking forward to feedback on this, and hope to get some reviews. :)
>>>>>
>>>>> Thanks,
>>>>> Erik
>
>