RFR: 7143664: Clean up OrderAccess implementations and usage

Wed Jan 28 06:09:26 UTC 2015

Hi Erik,

Sorry for the delay, it was a long weekend here and I'm playing catch up. :)

On 24/01/2015 9:03 AM, Erik Österlund wrote:
> Hi Dean,
>
> On 23/01/15 22:50, Dean Long wrote:
>> On 1/23/2015 7:16 AM, Erik Österlund wrote:
>>> Hi Dean and David,
>>>
>>> Thanks for reviewing this change.
>>>
>>> On 23/01/15 08:12, David Holmes wrote:
>>>> On 23/01/2015 4:36 PM, Dean Long wrote:
>>>>>       147 // Ordering a load relative to preceding stores requires a fence,
>>>>>       148 // which implies a membar #StoreLoad between the store and load under
>>>>>       149 // sparc-TSO.  A fence is required by x86.  On x86, we use explicitly
>>>>>       150 // locked add.
>>>>>       151 //
>>>>>
>>>>> It sounds like the above is saying that fence is the same as StoreLoad ...
>>>> That may be a s/store_fence/fence/ typo. The original text was:
>>>>
>>>> // Ordering a load relative to preceding stores requires a store_fence,
>>>> // which implies a membar #StoreLoad between the store and load under
>>>> // sparc-TSO.  A fence is required by ia64.  On x86, we use locked xchg.
>>>>
>>>> Actually seems like a couple of typos there: ia64 became x86, then we
>>>> have x86 again.
>>> I tried to remove ia64 from the comments as it is not in the repository,
>>> which in this case became a bit redundant. Are we okay if I remove the
>>> "A fence is required by x86" sentence?
>>>
>>> Oh and Dean about membar #StoreLoad on SPARC and locked add on x86...
>>> are you saying they are not equivalent?
>>
>> I wasn't trying to say anything abou SPARC or x86, only that "fence" !=
>> "StoreLoad".
>
> Agreed.
>
>>> New version of that paragraph with that sentence removed:
>>>
>>> 147 // Ordering a load relative to preceding stores requires a fence,
>>> 148 // which implies a membar #StoreLoad between the store and load under
>>> 149 // sparc-TSO. On x86, we use explicitly locked add.
>>
>> I would replace "fence" with "StoreLoad" above.
>
> Again, I just changed any store_fence descriptions to use fence instead.
> But you are certainly right, it is more accurate to talk about StoreLoad
> here and not mention fence in the first place, I'll change this, thanks
> for the feedback. :)
>
> Changed to:
>
> 147 // Ordering a load relative to preceding stores requires a StoreLoad,
> 148 // which implies a membar #StoreLoad between the store and load under
> 149 // sparc-TSO. On x86, we use explicitly locked add.

The confusion here is what "store-fence" was intended to mean. In this 
part we're implying it is a storeLoad, but in the discussion below it 
indicates store-fence orders subsequent loads and stores - which makes 
it a storeStore|storeLoad barrier. Of course storeLoad subsumes 
storeStore so the above is accurate - though I think the original text 
would have been more clearly stated as "Ordering a load relative to 
preceding stores _can be achieved using a_ store-fence, ..."

So in summary: update above is okay.

>>>>>       152 // 4. store, load  <= is constrained by => store, fence, load
>>>>>       153 //
>>>>>       154 // Use store, fence to make sure all stores done in an 'interesting'
>>>>>       155 // region are made visible prior to both subsequent loads and stores.
>>>>>
>>>>> ... and this is like saying to use fence because StoreStore | StoreLoad
>>>>> isn't available.
>>>> Again this seems an issue with store_fence being changed to "store,
>>>> fence"  which doesn't really make sense.
>>> The thing is that I removed store_fence (not store_release as it says in
>>> the bug ID, sorry about that - there obviously is no store_release). So
>>> I wanted to remove references in the comments to store_fence as well.
>>> Therefore I changed the example to use store fence load instead of
>>> store_fence load. Perhaps it's better if I remove that sample all
>>> together instead?
>>
>> That works for me.
>
> Fixed.

Removal is an option but again replacing store-fence with what it really 
means "store; membar<storeStore|StoreLoad>" would also suffice. But if 
we don't need the combined  membar<storeStore|StoreLoad> such that 
abstracting it into something like store_fence() makes sense, then 
getting rid of everything pertaining to store-fence makes sense. I see 
there are no uses of store_fence() other than a commented out reference 
in ./cpu/x86/vm/c1_LIRAssembler_x86.cpp, which also contains a similar 
commented out reference to the non-existent load_fence() - both of which 
should be removed.

If you'd like to send me an updated patch I will generate a new webrev. 
Then I can get on with the functional part of the review.

Thanks,
David

> Thanks,
> /Erik
>
>> dl
>>
>>> Thanks for reviewing my changes!
>>>
>>> /Erik
>>>
>>>> David
>>>>
>>>>>> Furthermore, store_release was declared private and marked as
>>>>> deprecated.
>>>>>
>>>>> I can't find where this was done.
>>>>>
>>>>> dl
>>>>>
>>>>> On 1/22/2015 5:19 PM, Erik Österlund wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> == Summary of Changes ==
>>>>>>
>>>>>> This changeset fixes issues in OrderAccess on multiple levels from the
>>>>>> memory model semantics to compiler reorderings, to addressing
>>>>>> maintainability/portability issues which (almost) had to be fixed in
>>>>>> order to fix the correctness issues. It is the result of discussions
>>>>>> found in the previous "OrderAccess Refactoring" thread:
>>>>>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>>>>>
>>>>>> Bug ID:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-7143664
>>>>>> (updated to reflect these related changes)
>>>>>>
>>>>>> Webrev:
>>>>>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>>>>>
>>>>>> Before I describe more I would like to give special thanks to David
>>>>>> Holmes for long discussions leading up to the currently proposed
>>>>>> changes. I would also like to thank Jesper Wilhelmsson for helping me
>>>>>> run my changes through JPRT.
>>>>>>
>>>>>> == Motivation ==
>>>>>>
>>>>>> This change directly fixes a reported OrderAccess bug due to compiler
>>>>>> reorderings which is still a vulnerability on almost all TSO platforms:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-806196
>>>>>>
>>>>>> And directly fixes confusions like release_store() != release() store()
>>>>>> due to memory model issues previously described in this bug ID.
>>>>>>
>>>>>> At the same time it provides clearer design with separation of concerns
>>>>>> and generalization/specialization, removing a whole bunch of platform
>>>>>> specific code which could be generalized. The platform specific files
>>>>>> now only have a few LoC requirements (~7) to conform to the memory model
>>>>>> by specifying what the stand alone barriers do. Then optionally
>>>>>> optimizations to the general approach are possible if platforms support
>>>>>> it. This also makes it much easier to port to new platforms.
>>>>>>
>>>>>> == Memory Model ==
>>>>>>
>>>>>> The current definitions of acquire/release semantics are a bit fishy
>>>>>> leading to problems previously described in the bug ID (release_store()
>>>>>> != release() store()) and some other correctness issues. It has
>>>>>> therefore been replaced with a new model. I would like to thank David
>>>>>> Holmes for the long discussions leading up to the newly proposed model.
>>>>>>
>>>>>> The new model is formally defined like this:
>>>>>>
>>>>>> // T1: access_shared_data
>>>>>> // T1: ]release
>>>>>> // T1: (...)
>>>>>> // T1: store(X)
>>>>>> //
>>>>>> // T2: load(X)
>>>>>> // T2: (...)
>>>>>> // T2: acquire[
>>>>>> // T2: access_shared_data
>>>>>> //
>>>>>> // It is guaranteed that if T2: load(X) synchronizes with (observes the
>>>>>> // value written by) T1: store(X), then the memory accesses before the
>>>>>> // T1: ]release happen before the memory accesses after the T2: acquire[.
>>>>>>
>>>>>> The orderAccess.hpp file and bug ID also has a few additional
>>>>>> explanations making it more intuitive to the user when to use
>>>>>> acquire/release and the resemblance to TSO abstract machines. Big thanks
>>>>>> goes to David Holmes for discussing the memory model with me, which
>>>>>> helped a lot in deriving it.
>>>>>>
>>>>>> Now it holds that release() store() == release_store(), and the other
>>>>>> correctness issues are fixed as well.
>>>>>>
>>>>>> The new model is also closer to C++11 definitions which could give us
>>>>>> more relaxed compiler reordering constraints in the future when compiler
>>>>>> support for C++11 is there and ready.
>>>>>>
>>>>>> == Reliance on C++ Volatile Semantics ==
>>>>>>
>>>>>> The C++ standard section 1.9 "Program Execution" is very vague about
>>>>>> what the keyword volatile can actually do for us. It gives clear
>>>>>> constraints in terms of volatile-volatile accesses but says little about
>>>>>> nonvolatile-volatile accesses. Yet the current implementation heavily
>>>>>> relies upon volatile to in terms of compiler reordering. But GCC
>>>>>> explicitly declares that volatiles and non-volatiles may reorder freely
>>>>>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler
>>>>>> known to explicitly provide the wanted semantics with volatile is MSVC
>>>>>> 2010 for windows (
>>>>>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>>>>>> Compilers not giving explicit guarantees, must be considered unsafe and
>>>>>> revert to conservative behaviour.
>>>>>>
>>>>>> This was brought to attention after causing bugs, but was only fixed for
>>>>>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>>>>>> except windows, and has to be fixed on all of them.
>>>>>>
>>>>>> Several barriers are unsafe to use because they lack compiler reordering
>>>>>> constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms
>>>>>> they are typically implemented using dummy loads and stores. This seems
>>>>>> to be another old volatile reliance that I fixed. These barriers
>>>>>> sometimes have omitted compiler barriers (which is what we really want).
>>>>>> This seems to be another example on incorrect reliance on the volatile
>>>>>> semantics to help us. Therefore dummy loads/stores have been replaced
>>>>>> with compiler barriers on TSO platforms.
>>>>>>
>>>>>> It is also worth noting that compilers like sun studio did previously
>>>>>> not support inline asm syntax. Therefore, barriers were implemented in
>>>>>> .il-files. However, using them does not give explicit compiler
>>>>>> constraints for reordering AFAIK. Therefore, they have been
>>>>>> reimplemented using inline asm with explicit compiler reordering
>>>>>> constraints, as even sun (solaris?) studio now supports this.
>>>>>>
>>>>>> The windows variants have added a windows-style _ReadWriteBarrier()
>>>>>> compiler barrier similarly.
>>>>>>
>>>>>> == Strange Hardware Reorderings ==
>>>>>>
>>>>>> Fixed a weird inconsistency where acquire, loadstore and loadlaod would
>>>>>> use isync instead of lwsync for PPC on linux_zero, but not in any other
>>>>>> PPC platform in the repo. I assumed this is wrong and changed it to
>>>>>> lwsync instead.
>>>>>>
>>>>>> == Code Redundancy and Refactoring ==
>>>>>>
>>>>>> The OrderAccess code looks like it has been developed over a long period
>>>>>> of time, with small incremental changes. This seems to have led to a lot
>>>>>> of code duplication over time. For example, store_fence variants are not
>>>>>> referenced from anywhere, yet contribute to a lot of the code base and a
>>>>>> lot of awkwardness (such as being the one only exception not using
>>>>>> volatiles for some reason). Moreover, store_fence is not used anywhere
>>>>>> in hotspot because it is not a good fit for either the acquire/release
>>>>>> semantics or the Java volatile semantics, leaving a big question mark on
>>>>>> when it should ever be used. I took the liberty of removing it.
>>>>>>
>>>>>> Another redundancy issue is that most of the semantics is exactly the
>>>>>> same for all platforms, yet all that default boilerplate such as how to
>>>>>> make atomic accesses, where acquire/release is supposed to be placed
>>>>>> w.r.t. the memory access, what the different barriers should do etc. is
>>>>>> copied in redundantly for each os_cpu and each type of memory access for
>>>>>> each os_cpu. This makes it extremely painful 1) to understand what
>>>>>> actually defines a certain platform compared to the defaults and 2) to
>>>>>> correct bugs like those discovered here 3) prevent silly mistakes and
>>>>>> bugs, by simply having a lot less code defining the behaviour of
>>>>>> OrderAccess that could go wrong.
>>>>>>
>>>>>> A new architecture/design for OrderAccess is proposed, using a
>>>>>> generalization/specialization approach.
>>>>>>
>>>>>> A general implementation in /share/ defines how things work and splits
>>>>>> into different concerns: 1) how to make an atomic memory access, 2)
>>>>>> where to but barriers w.r.t. the memory access for things like
>>>>>> load_acquire, release_store and release_store_fence, 3) how these
>>>>>> barriers are defined.
>>>>>>
>>>>>> This allows a clear split between what is required for following the
>>>>>> specifications, and optimizations, which become much more readable and
>>>>>> only optimizations need to be reviewed in depth as the defaults can
>>>>>> always be trusted given correct standalone barriers.
>>>>>>
>>>>>> The only thing a platform is required to specify, is what an
>>>>>> implementation of acquire(), release() and fence() should do. If they
>>>>>> are implemented properly, everything in OrderAccess is guaranteed to
>>>>>> work according to specification using the generalized code. This makes
>>>>>> it very easy to support new ports. ALL the other code in the os_cpu
>>>>>> files is used /only/ for optimization purposes offered for specific
>>>>>> configurations.
>>>>>>
>>>>>> However, it is highly customizable so that specific platform can perform
>>>>>> any desired optimizations. For instance this load_acquire on PPC is
>>>>>> optimized:
>>>>>>
>>>>>> template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte>
>>>>>> (volatile jbyte*  p) { register jbyte t = load(p);
>>>>>> inlasm_acquire_reg(t); return t; }
>>>>>>
>>>>>> This overrides the whole load_acquire implementation to do something
>>>>>> custom. Platforms like x86 extensively use this for joined fencing
>>>>>> variants to optimize.
>>>>>>
>>>>>> The default implementation of load_acquire() will make an atomic load()
>>>>>> followed by acquire() since the semantics is generalized. The
>>>>>> generalized semantics are defined using inlined postfix/prefix calls
>>>>>> after/before the atomic access, as defined here:
>>>>>>
>>>>>> template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>>>>>> OrderAccess::acquire(); }
>>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        {
>>>>>> OrderAccess::release(); }
>>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>>>>>> OrderAccess::release(); }
>>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>>>>>> OrderAccess::fence();   }
>>>>>>
>>>>>> For platforms that simply wish to override what e.g. acquire means for a
>>>>>> joined ordered memory access in general, as different to calling stand
>>>>>> alone acquire(), the semantics can be easily overridden for a platform
>>>>>> such as windows like on windows:
>>>>>>
>>>>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>>>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>>>>> OrderAccess::fence(); }
>>>>>>
>>>>>> In this example, since Windows (now) has a compiler barrier for acquire,
>>>>>> but does not need it for joined accesses since volatile has stronger
>>>>>> guarantees on windows, this is enough to specialize that for joined
>>>>>> memory accesses, no extra protection is needed.
>>>>>>
>>>>>> == Backward Compatibility and Transitioning ==
>>>>>>
>>>>>> Since the newly proposed code is structured differently to before, a
>>>>>> #define was added for backward compatibility so that external
>>>>>> repositories not adhering to this new architecture do not break.
>>>>>> Furthermore, store_release was declared private and marked as
>>>>>> deprecated. This allows for a smooth transition into the new style of
>>>>>> OrderAccess. When the full transition is made in all known repos, the
>>>>>> #define and store_fence may be safely removed, eventually.
>>>>>>
>>>>>> == Documentation ==
>>>>>>
>>>>>> The documentation seems old and outdated, describing how it works on
>>>>>> SPARC RMO and IA64, which are nowhere to be found in the repository. It
>>>>>> also describes things about C++ volatiles which cannot be relied upon.
>>>>>> The documentation has been cleaned up to match the current state of the
>>>>>> implementation better, with architectures actually found in the
>>>>>> repository.
>>>>>>
>>>>>> == Testing ==
>>>>>>
>>>>>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
>>>>>> Ran some DaCapo benchmarks (I know okay :p) for performance regression
>>>>>> and there was no perceivable difference.
>>>>>>
>>>>>> Looking forward to feedback on this, and hope to get some reviews. :)
>>>>>>
>>>>>> Thanks,
>>>>>> Erik
>>
>>