RFR: 7143664: Clean up OrderAccess implementations and usage

Thu Jan 29 02:16:01 UTC 2015

Hi David,

Thank you for uploading the latest webrev. :)

/Erik

On 28/01/15 23:48, David Holmes wrote:
> Updated webrev:
>
> http://cr.openjdk.java.net/~dholmes/7143664/webrev.v2/
>
> Thanks,
> David
>
> On 28/01/2015 4:09 PM, David Holmes wrote:
>> Hi Erik,
>>
>> Sorry for the delay, it was a long weekend here and I'm playing catch
>> up. :)
>>
>> On 24/01/2015 9:03 AM, Erik Österlund wrote:
>>> Hi Dean,
>>>
>>> On 23/01/15 22:50, Dean Long wrote:
>>>> On 1/23/2015 7:16 AM, Erik Österlund wrote:
>>>>> Hi Dean and David,
>>>>>
>>>>> Thanks for reviewing this change.
>>>>>
>>>>> On 23/01/15 08:12, David Holmes wrote:
>>>>>> On 23/01/2015 4:36 PM, Dean Long wrote:
>>>>>>>        147 // Ordering a load relative to preceding stores requires
>>>>>>> a fence,
>>>>>>>        148 // which implies a membar #StoreLoad between the store
>>>>>>> and load under
>>>>>>>        149 // sparc-TSO.  A fence is required by x86.  On x86, we
>>>>>>> use explicitly
>>>>>>>        150 // locked add.
>>>>>>>        151 //
>>>>>>>
>>>>>>> It sounds like the above is saying that fence is the same as
>>>>>>> StoreLoad ...
>>>>>> That may be a s/store_fence/fence/ typo. The original text was:
>>>>>>
>>>>>> // Ordering a load relative to preceding stores requires a
>>>>>> store_fence,
>>>>>> // which implies a membar #StoreLoad between the store and load under
>>>>>> // sparc-TSO.  A fence is required by ia64.  On x86, we use locked
>>>>>> xchg.
>>>>>>
>>>>>> Actually seems like a couple of typos there: ia64 became x86, then we
>>>>>> have x86 again.
>>>>> I tried to remove ia64 from the comments as it is not in the
>>>>> repository,
>>>>> which in this case became a bit redundant. Are we okay if I remove the
>>>>> "A fence is required by x86" sentence?
>>>>>
>>>>> Oh and Dean about membar #StoreLoad on SPARC and locked add on x86...
>>>>> are you saying they are not equivalent?
>>>>
>>>> I wasn't trying to say anything abou SPARC or x86, only that "fence" !=
>>>> "StoreLoad".
>>>
>>> Agreed.
>>>
>>>>> New version of that paragraph with that sentence removed:
>>>>>
>>>>> 147 // Ordering a load relative to preceding stores requires a fence,
>>>>> 148 // which implies a membar #StoreLoad between the store and load
>>>>> under
>>>>> 149 // sparc-TSO. On x86, we use explicitly locked add.
>>>>
>>>> I would replace "fence" with "StoreLoad" above.
>>>
>>> Again, I just changed any store_fence descriptions to use fence instead.
>>> But you are certainly right, it is more accurate to talk about StoreLoad
>>> here and not mention fence in the first place, I'll change this, thanks
>>> for the feedback. :)
>>>
>>> Changed to:
>>>
>>> 147 // Ordering a load relative to preceding stores requires a StoreLoad,
>>> 148 // which implies a membar #StoreLoad between the store and load under
>>> 149 // sparc-TSO. On x86, we use explicitly locked add.
>>
>> The confusion here is what "store-fence" was intended to mean. In this
>> part we're implying it is a storeLoad, but in the discussion below it
>> indicates store-fence orders subsequent loads and stores - which makes
>> it a storeStore|storeLoad barrier. Of course storeLoad subsumes
>> storeStore so the above is accurate - though I think the original text
>> would have been more clearly stated as "Ordering a load relative to
>> preceding stores _can be achieved using a_ store-fence, ..."
>>
>> So in summary: update above is okay.
>>
>>>>>>>        152 // 4. store, load  <= is constrained by => store, fence,
>>>>>>> load
>>>>>>>        153 //
>>>>>>>        154 // Use store, fence to make sure all stores done in an
>>>>>>> 'interesting'
>>>>>>>        155 // region are made visible prior to both subsequent
>>>>>>> loads and stores.
>>>>>>>
>>>>>>> ... and this is like saying to use fence because StoreStore |
>>>>>>> StoreLoad
>>>>>>> isn't available.
>>>>>> Again this seems an issue with store_fence being changed to "store,
>>>>>> fence"  which doesn't really make sense.
>>>>> The thing is that I removed store_fence (not store_release as it
>>>>> says in
>>>>> the bug ID, sorry about that - there obviously is no store_release). So
>>>>> I wanted to remove references in the comments to store_fence as well.
>>>>> Therefore I changed the example to use store fence load instead of
>>>>> store_fence load. Perhaps it's better if I remove that sample all
>>>>> together instead?
>>>>
>>>> That works for me.
>>>
>>> Fixed.
>>
>> Removal is an option but again replacing store-fence with what it really
>> means "store; membar<storeStore|StoreLoad>" would also suffice. But if
>> we don't need the combined  membar<storeStore|StoreLoad> such that
>> abstracting it into something like store_fence() makes sense, then
>> getting rid of everything pertaining to store-fence makes sense. I see
>> there are no uses of store_fence() other than a commented out reference
>> in ./cpu/x86/vm/c1_LIRAssembler_x86.cpp, which also contains a similar
>> commented out reference to the non-existent load_fence() - both of which
>> should be removed.
>>
>> If you'd like to send me an updated patch I will generate a new webrev.
>> Then I can get on with the functional part of the review.
>>
>> Thanks,
>> David
>>
>>> Thanks,
>>> /Erik
>>>
>>>> dl
>>>>
>>>>> Thanks for reviewing my changes!
>>>>>
>>>>> /Erik
>>>>>
>>>>>> David
>>>>>>
>>>>>>>> Furthermore, store_release was declared private and marked as
>>>>>>> deprecated.
>>>>>>>
>>>>>>> I can't find where this was done.
>>>>>>>
>>>>>>> dl
>>>>>>>
>>>>>>> On 1/22/2015 5:19 PM, Erik Österlund wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> == Summary of Changes ==
>>>>>>>>
>>>>>>>> This changeset fixes issues in OrderAccess on multiple levels
>>>>>>>> from the
>>>>>>>> memory model semantics to compiler reorderings, to addressing
>>>>>>>> maintainability/portability issues which (almost) had to be fixed in
>>>>>>>> order to fix the correctness issues. It is the result of discussions
>>>>>>>> found in the previous "OrderAccess Refactoring" thread:
>>>>>>>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>>>>>>>
>>>>>>>>
>>>>>>>> Bug ID:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-7143664
>>>>>>>> (updated to reflect these related changes)
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>>>>>>>
>>>>>>>> Before I describe more I would like to give special thanks to David
>>>>>>>> Holmes for long discussions leading up to the currently proposed
>>>>>>>> changes. I would also like to thank Jesper Wilhelmsson for
>>>>>>>> helping me
>>>>>>>> run my changes through JPRT.
>>>>>>>>
>>>>>>>> == Motivation ==
>>>>>>>>
>>>>>>>> This change directly fixes a reported OrderAccess bug due to
>>>>>>>> compiler
>>>>>>>> reorderings which is still a vulnerability on almost all TSO
>>>>>>>> platforms:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-806196
>>>>>>>>
>>>>>>>> And directly fixes confusions like release_store() != release()
>>>>>>>> store()
>>>>>>>> due to memory model issues previously described in this bug ID.
>>>>>>>>
>>>>>>>> At the same time it provides clearer design with separation of
>>>>>>>> concerns
>>>>>>>> and generalization/specialization, removing a whole bunch of
>>>>>>>> platform
>>>>>>>> specific code which could be generalized. The platform specific
>>>>>>>> files
>>>>>>>> now only have a few LoC requirements (~7) to conform to the
>>>>>>>> memory model
>>>>>>>> by specifying what the stand alone barriers do. Then optionally
>>>>>>>> optimizations to the general approach are possible if platforms
>>>>>>>> support
>>>>>>>> it. This also makes it much easier to port to new platforms.
>>>>>>>>
>>>>>>>> == Memory Model ==
>>>>>>>>
>>>>>>>> The current definitions of acquire/release semantics are a bit fishy
>>>>>>>> leading to problems previously described in the bug ID
>>>>>>>> (release_store()
>>>>>>>> != release() store()) and some other correctness issues. It has
>>>>>>>> therefore been replaced with a new model. I would like to thank
>>>>>>>> David
>>>>>>>> Holmes for the long discussions leading up to the newly proposed
>>>>>>>> model.
>>>>>>>>
>>>>>>>> The new model is formally defined like this:
>>>>>>>>
>>>>>>>> // T1: access_shared_data
>>>>>>>> // T1: ]release
>>>>>>>> // T1: (...)
>>>>>>>> // T1: store(X)
>>>>>>>> //
>>>>>>>> // T2: load(X)
>>>>>>>> // T2: (...)
>>>>>>>> // T2: acquire[
>>>>>>>> // T2: access_shared_data
>>>>>>>> //
>>>>>>>> // It is guaranteed that if T2: load(X) synchronizes with
>>>>>>>> (observes the
>>>>>>>> // value written by) T1: store(X), then the memory accesses
>>>>>>>> before the
>>>>>>>> // T1: ]release happen before the memory accesses after the T2:
>>>>>>>> acquire[.
>>>>>>>>
>>>>>>>> The orderAccess.hpp file and bug ID also has a few additional
>>>>>>>> explanations making it more intuitive to the user when to use
>>>>>>>> acquire/release and the resemblance to TSO abstract machines. Big
>>>>>>>> thanks
>>>>>>>> goes to David Holmes for discussing the memory model with me, which
>>>>>>>> helped a lot in deriving it.
>>>>>>>>
>>>>>>>> Now it holds that release() store() == release_store(), and the
>>>>>>>> other
>>>>>>>> correctness issues are fixed as well.
>>>>>>>>
>>>>>>>> The new model is also closer to C++11 definitions which could
>>>>>>>> give us
>>>>>>>> more relaxed compiler reordering constraints in the future when
>>>>>>>> compiler
>>>>>>>> support for C++11 is there and ready.
>>>>>>>>
>>>>>>>> == Reliance on C++ Volatile Semantics ==
>>>>>>>>
>>>>>>>> The C++ standard section 1.9 "Program Execution" is very vague about
>>>>>>>> what the keyword volatile can actually do for us. It gives clear
>>>>>>>> constraints in terms of volatile-volatile accesses but says
>>>>>>>> little about
>>>>>>>> nonvolatile-volatile accesses. Yet the current implementation
>>>>>>>> heavily
>>>>>>>> relies upon volatile to in terms of compiler reordering. But GCC
>>>>>>>> explicitly declares that volatiles and non-volatiles may reorder
>>>>>>>> freely
>>>>>>>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only
>>>>>>>> compiler
>>>>>>>> known to explicitly provide the wanted semantics with volatile is
>>>>>>>> MSVC
>>>>>>>> 2010 for windows (
>>>>>>>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>>>>>>>> Compilers not giving explicit guarantees, must be considered
>>>>>>>> unsafe and
>>>>>>>> revert to conservative behaviour.
>>>>>>>>
>>>>>>>> This was brought to attention after causing bugs, but was only
>>>>>>>> fixed for
>>>>>>>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>>>>>>>> except windows, and has to be fixed on all of them.
>>>>>>>>
>>>>>>>> Several barriers are unsafe to use because they lack compiler
>>>>>>>> reordering
>>>>>>>> constraints (e.g. fence and acquire on linux_SPARC). For TSO
>>>>>>>> platforms
>>>>>>>> they are typically implemented using dummy loads and stores. This
>>>>>>>> seems
>>>>>>>> to be another old volatile reliance that I fixed. These barriers
>>>>>>>> sometimes have omitted compiler barriers (which is what we really
>>>>>>>> want).
>>>>>>>> This seems to be another example on incorrect reliance on the
>>>>>>>> volatile
>>>>>>>> semantics to help us. Therefore dummy loads/stores have been
>>>>>>>> replaced
>>>>>>>> with compiler barriers on TSO platforms.
>>>>>>>>
>>>>>>>> It is also worth noting that compilers like sun studio did
>>>>>>>> previously
>>>>>>>> not support inline asm syntax. Therefore, barriers were
>>>>>>>> implemented in
>>>>>>>> .il-files. However, using them does not give explicit compiler
>>>>>>>> constraints for reordering AFAIK. Therefore, they have been
>>>>>>>> reimplemented using inline asm with explicit compiler reordering
>>>>>>>> constraints, as even sun (solaris?) studio now supports this.
>>>>>>>>
>>>>>>>> The windows variants have added a windows-style _ReadWriteBarrier()
>>>>>>>> compiler barrier similarly.
>>>>>>>>
>>>>>>>> == Strange Hardware Reorderings ==
>>>>>>>>
>>>>>>>> Fixed a weird inconsistency where acquire, loadstore and loadlaod
>>>>>>>> would
>>>>>>>> use isync instead of lwsync for PPC on linux_zero, but not in any
>>>>>>>> other
>>>>>>>> PPC platform in the repo. I assumed this is wrong and changed it to
>>>>>>>> lwsync instead.
>>>>>>>>
>>>>>>>> == Code Redundancy and Refactoring ==
>>>>>>>>
>>>>>>>> The OrderAccess code looks like it has been developed over a long
>>>>>>>> period
>>>>>>>> of time, with small incremental changes. This seems to have led
>>>>>>>> to a lot
>>>>>>>> of code duplication over time. For example, store_fence variants
>>>>>>>> are not
>>>>>>>> referenced from anywhere, yet contribute to a lot of the code
>>>>>>>> base and a
>>>>>>>> lot of awkwardness (such as being the one only exception not using
>>>>>>>> volatiles for some reason). Moreover, store_fence is not used
>>>>>>>> anywhere
>>>>>>>> in hotspot because it is not a good fit for either the
>>>>>>>> acquire/release
>>>>>>>> semantics or the Java volatile semantics, leaving a big question
>>>>>>>> mark on
>>>>>>>> when it should ever be used. I took the liberty of removing it.
>>>>>>>>
>>>>>>>> Another redundancy issue is that most of the semantics is exactly
>>>>>>>> the
>>>>>>>> same for all platforms, yet all that default boilerplate such as
>>>>>>>> how to
>>>>>>>> make atomic accesses, where acquire/release is supposed to be placed
>>>>>>>> w.r.t. the memory access, what the different barriers should do
>>>>>>>> etc. is
>>>>>>>> copied in redundantly for each os_cpu and each type of memory
>>>>>>>> access for
>>>>>>>> each os_cpu. This makes it extremely painful 1) to understand what
>>>>>>>> actually defines a certain platform compared to the defaults and
>>>>>>>> 2) to
>>>>>>>> correct bugs like those discovered here 3) prevent silly mistakes
>>>>>>>> and
>>>>>>>> bugs, by simply having a lot less code defining the behaviour of
>>>>>>>> OrderAccess that could go wrong.
>>>>>>>>
>>>>>>>> A new architecture/design for OrderAccess is proposed, using a
>>>>>>>> generalization/specialization approach.
>>>>>>>>
>>>>>>>> A general implementation in /share/ defines how things work and
>>>>>>>> splits
>>>>>>>> into different concerns: 1) how to make an atomic memory access, 2)
>>>>>>>> where to but barriers w.r.t. the memory access for things like
>>>>>>>> load_acquire, release_store and release_store_fence, 3) how these
>>>>>>>> barriers are defined.
>>>>>>>>
>>>>>>>> This allows a clear split between what is required for following the
>>>>>>>> specifications, and optimizations, which become much more
>>>>>>>> readable and
>>>>>>>> only optimizations need to be reviewed in depth as the defaults can
>>>>>>>> always be trusted given correct standalone barriers.
>>>>>>>>
>>>>>>>> The only thing a platform is required to specify, is what an
>>>>>>>> implementation of acquire(), release() and fence() should do. If
>>>>>>>> they
>>>>>>>> are implemented properly, everything in OrderAccess is guaranteed to
>>>>>>>> work according to specification using the generalized code. This
>>>>>>>> makes
>>>>>>>> it very easy to support new ports. ALL the other code in the os_cpu
>>>>>>>> files is used /only/ for optimization purposes offered for specific
>>>>>>>> configurations.
>>>>>>>>
>>>>>>>> However, it is highly customizable so that specific platform can
>>>>>>>> perform
>>>>>>>> any desired optimizations. For instance this load_acquire on PPC is
>>>>>>>> optimized:
>>>>>>>>
>>>>>>>> template<> inline jbyte
>>>>>>>> OrderAccess::specialized_load_acquire<jbyte>
>>>>>>>> (volatile jbyte*  p) { register jbyte t = load(p);
>>>>>>>> inlasm_acquire_reg(t); return t; }
>>>>>>>>
>>>>>>>> This overrides the whole load_acquire implementation to do something
>>>>>>>> custom. Platforms like x86 extensively use this for joined fencing
>>>>>>>> variants to optimize.
>>>>>>>>
>>>>>>>> The default implementation of load_acquire() will make an atomic
>>>>>>>> load()
>>>>>>>> followed by acquire() since the semantics is generalized. The
>>>>>>>> generalized semantics are defined using inlined postfix/prefix calls
>>>>>>>> after/before the atomic access, as defined here:
>>>>>>>>
>>>>>>>> template<> inline void
>>>>>>>> ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>>>>>>>> OrderAccess::acquire(); }
>>>>>>>> template<> inline void
>>>>>>>> ScopedFenceGeneral<RELEASE_X>::prefix()        {
>>>>>>>> OrderAccess::release(); }
>>>>>>>> template<> inline void
>>>>>>>> ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>>>>>>>> OrderAccess::release(); }
>>>>>>>> template<> inline void
>>>>>>>> ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>>>>>>>> OrderAccess::fence();   }
>>>>>>>>
>>>>>>>> For platforms that simply wish to override what e.g. acquire
>>>>>>>> means for a
>>>>>>>> joined ordered memory access in general, as different to calling
>>>>>>>> stand
>>>>>>>> alone acquire(), the semantics can be easily overridden for a
>>>>>>>> platform
>>>>>>>> such as windows like on windows:
>>>>>>>>
>>>>>>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>>>>>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>>>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>>>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>>>>>>> OrderAccess::fence(); }
>>>>>>>>
>>>>>>>> In this example, since Windows (now) has a compiler barrier for
>>>>>>>> acquire,
>>>>>>>> but does not need it for joined accesses since volatile has stronger
>>>>>>>> guarantees on windows, this is enough to specialize that for joined
>>>>>>>> memory accesses, no extra protection is needed.
>>>>>>>>
>>>>>>>> == Backward Compatibility and Transitioning ==
>>>>>>>>
>>>>>>>> Since the newly proposed code is structured differently to before, a
>>>>>>>> #define was added for backward compatibility so that external
>>>>>>>> repositories not adhering to this new architecture do not break.
>>>>>>>> Furthermore, store_release was declared private and marked as
>>>>>>>> deprecated. This allows for a smooth transition into the new
>>>>>>>> style of
>>>>>>>> OrderAccess. When the full transition is made in all known repos,
>>>>>>>> the
>>>>>>>> #define and store_fence may be safely removed, eventually.
>>>>>>>>
>>>>>>>> == Documentation ==
>>>>>>>>
>>>>>>>> The documentation seems old and outdated, describing how it works on
>>>>>>>> SPARC RMO and IA64, which are nowhere to be found in the
>>>>>>>> repository. It
>>>>>>>> also describes things about C++ volatiles which cannot be relied
>>>>>>>> upon.
>>>>>>>> The documentation has been cleaned up to match the current state
>>>>>>>> of the
>>>>>>>> implementation better, with architectures actually found in the
>>>>>>>> repository.
>>>>>>>>
>>>>>>>> == Testing ==
>>>>>>>>
>>>>>>>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these
>>>>>>>> changes.
>>>>>>>> Ran some DaCapo benchmarks (I know okay :p) for performance
>>>>>>>> regression
>>>>>>>> and there was no perceivable difference.
>>>>>>>>
>>>>>>>> Looking forward to feedback on this, and hope to get some
>>>>>>>> reviews. :)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Erik
>>>>
>>>>
>