RFR: 7143664: Clean up OrderAccess implementations and usage

Fri Feb 20 16:04:35 UTC 2015

Erik,

I am so grateful you tackled this problem  I very much appreciate the additional potential
stability with the fixes.

Thank you to you and David for walking through the compiler and platform details, and
for checking that the generated code here makes sense.

Comments/questions on webrev.v3 open
0. orderAccess.hpp
Thank you for the updated comments and table - very helpful!

1. orderAccess.hpp
Were you going to add a comment about MSVC assumptions under C++ Volatile Semantics?

Is it worth documenting in source somewhere what minimal versions of compilers are assumed?
Or which versions of compilers were tested? I think that would help us in future please.
Maybe in the platform-specific files?

2. orderAccess_windows_x86.inline.hpp
Could you possibly add to the comment about MSVC, that this assumes all of the Scoped templates explicitly cast the addresses to volatiles?
(In case someone changes that some day?)

3. orderAccess_windows_x86.inline.hpp
So I looked up _ReadWriteBarrier and VS2012 and VS2013 say this is deprecated. 
JDK8 builds on VS2010, so ok. JDK9 appears to be planning to move to VS2013, so we may need to change this.
It appears that the recommendation is volatile (with default on x86 of /volatile:ms). Can we use __asm__volatile("" : : : "memory";
compiler _barrier for VS2013 or is there an equivalent you know about?

4. orderAccess_windows_x86.inline.hpp, OrderAccess_bsd_x86.inline.hpp
specialized_release_store_fence<jfloat> gets turned into a jint*?
I have avoided FP all my life, so I had assumed that the old *p = v when declared jfloat*/jfloat would use floating point
and the new one would cast to an int? Or does this work in the debugger? If so, a comment would be helpful.

So - do you have plans to do additional changes to the OrderAccess logic? Beyond getting rid of the obsolete entries 
when all platforms are ok with it?

Looks like we have a couple of follow-up exercises:
1) check our is_MP usage to ensure we get the compiler_barriers in both paths if we are not using volatile declarations for
all compiler ordering
2) try applying these templates to any other platforms to see how well they work for us as maintainers 
3) figure out why we use StubRoutines::fence on amd64 Windows but on linux we just use lock; addl 0(sp), MSVC limitation?

I did not review ppc, zero, arm.

thanks!
Karen
On Jan 22, 2015, at 8:19 PM, Erik Österlund wrote:

> Hi all,
> 
> == Summary of Changes ==
> 
> This changeset fixes issues in OrderAccess on multiple levels from the 
> memory model semantics to compiler reorderings, to addressing 
> maintainability/portability issues which (almost) had to be fixed in 
> order to fix the correctness issues. It is the result of discussions 
> found in the previous "OrderAccess Refactoring" thread: 
> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
> 
> Bug ID:
> https://bugs.openjdk.java.net/browse/JDK-7143664
> (updated to reflect these related changes)
> 
> Webrev:
> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
> 
> Before I describe more I would like to give special thanks to David 
> Holmes for long discussions leading up to the currently proposed 
> changes. I would also like to thank Jesper Wilhelmsson for helping me 
> run my changes through JPRT.
> 
> == Motivation ==
> 
> This change directly fixes a reported OrderAccess bug due to compiler 
> reorderings which is still a vulnerability on almost all TSO platforms:
> https://bugs.openjdk.java.net/browse/JDK-806196
> 
> And directly fixes confusions like release_store() != release() store() 
> due to memory model issues previously described in this bug ID.
> 
> At the same time it provides clearer design with separation of concerns 
> and generalization/specialization, removing a whole bunch of platform 
> specific code which could be generalized. The platform specific files 
> now only have a few LoC requirements (~7) to conform to the memory model 
> by specifying what the stand alone barriers do. Then optionally 
> optimizations to the general approach are possible if platforms support 
> it. This also makes it much easier to port to new platforms.
> 
> == Memory Model ==
> 
> The current definitions of acquire/release semantics are a bit fishy 
> leading to problems previously described in the bug ID (release_store() 
> != release() store()) and some other correctness issues. It has 
> therefore been replaced with a new model. I would like to thank David 
> Holmes for the long discussions leading up to the newly proposed model.
> 
> The new model is formally defined like this:
> 
> // T1: access_shared_data
> // T1: ]release
> // T1: (...)
> // T1: store(X)
> //
> // T2: load(X)
> // T2: (...)
> // T2: acquire[
> // T2: access_shared_data
> //
> // It is guaranteed that if T2: load(X) synchronizes with (observes the
> // value written by) T1: store(X), then the memory accesses before the
> // T1: ]release happen before the memory accesses after the T2: acquire[.
> 
> The orderAccess.hpp file and bug ID also has a few additional 
> explanations making it more intuitive to the user when to use 
> acquire/release and the resemblance to TSO abstract machines. Big thanks 
> goes to David Holmes for discussing the memory model with me, which 
> helped a lot in deriving it.
> 
> Now it holds that release() store() == release_store(), and the other 
> correctness issues are fixed as well.
> 
> The new model is also closer to C++11 definitions which could give us 
> more relaxed compiler reordering constraints in the future when compiler 
> support for C++11 is there and ready.
> 
> == Reliance on C++ Volatile Semantics ==
> 
> The C++ standard section 1.9 "Program Execution" is very vague about 
> what the keyword volatile can actually do for us. It gives clear 
> constraints in terms of volatile-volatile accesses but says little about 
> nonvolatile-volatile accesses. Yet the current implementation heavily 
> relies upon volatile to in terms of compiler reordering. But GCC 
> explicitly declares that volatiles and non-volatiles may reorder freely 
> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler 
> known to explicitly provide the wanted semantics with volatile is MSVC 
> 2010 for windows ( 
> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ). 
> Compilers not giving explicit guarantees, must be considered unsafe and 
> revert to conservative behaviour.
> 
> This was brought to attention after causing bugs, but was only fixed for 
> x86 linux. This is a fundamental issue inherent to all TSO platforms 
> except windows, and has to be fixed on all of them.
> 
> Several barriers are unsafe to use because they lack compiler reordering 
> constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms 
> they are typically implemented using dummy loads and stores. This seems 
> to be another old volatile reliance that I fixed. These barriers 
> sometimes have omitted compiler barriers (which is what we really want). 
> This seems to be another example on incorrect reliance on the volatile 
> semantics to help us. Therefore dummy loads/stores have been replaced 
> with compiler barriers on TSO platforms.
> 
> It is also worth noting that compilers like sun studio did previously 
> not support inline asm syntax. Therefore, barriers were implemented in 
> .il-files. However, using them does not give explicit compiler 
> constraints for reordering AFAIK. Therefore, they have been 
> reimplemented using inline asm with explicit compiler reordering 
> constraints, as even sun (solaris?) studio now supports this.
> 
> The windows variants have added a windows-style _ReadWriteBarrier() 
> compiler barrier similarly.
> 
> == Strange Hardware Reorderings ==
> 
> Fixed a weird inconsistency where acquire, loadstore and loadlaod would 
> use isync instead of lwsync for PPC on linux_zero, but not in any other 
> PPC platform in the repo. I assumed this is wrong and changed it to 
> lwsync instead.
> 
> == Code Redundancy and Refactoring ==
> 
> The OrderAccess code looks like it has been developed over a long period 
> of time, with small incremental changes. This seems to have led to a lot 
> of code duplication over time. For example, store_fence variants are not 
> referenced from anywhere, yet contribute to a lot of the code base and a 
> lot of awkwardness (such as being the one only exception not using 
> volatiles for some reason). Moreover, store_fence is not used anywhere 
> in hotspot because it is not a good fit for either the acquire/release 
> semantics or the Java volatile semantics, leaving a big question mark on 
> when it should ever be used. I took the liberty of removing it.
> 
> Another redundancy issue is that most of the semantics is exactly the 
> same for all platforms, yet all that default boilerplate such as how to 
> make atomic accesses, where acquire/release is supposed to be placed 
> w.r.t. the memory access, what the different barriers should do etc. is 
> copied in redundantly for each os_cpu and each type of memory access for 
> each os_cpu. This makes it extremely painful 1) to understand what 
> actually defines a certain platform compared to the defaults and 2) to 
> correct bugs like those discovered here 3) prevent silly mistakes and 
> bugs, by simply having a lot less code defining the behaviour of 
> OrderAccess that could go wrong.
> 
> A new architecture/design for OrderAccess is proposed, using a 
> generalization/specialization approach.
> 
> A general implementation in /share/ defines how things work and splits 
> into different concerns: 1) how to make an atomic memory access, 2) 
> where to but barriers w.r.t. the memory access for things like 
> load_acquire, release_store and release_store_fence, 3) how these 
> barriers are defined.
> 
> This allows a clear split between what is required for following the 
> specifications, and optimizations, which become much more readable and 
> only optimizations need to be reviewed in depth as the defaults can 
> always be trusted given correct standalone barriers.
> 
> The only thing a platform is required to specify, is what an 
> implementation of acquire(), release() and fence() should do. If they 
> are implemented properly, everything in OrderAccess is guaranteed to 
> work according to specification using the generalized code. This makes 
> it very easy to support new ports. ALL the other code in the os_cpu 
> files is used /only/ for optimization purposes offered for specific 
> configurations.
> 
> However, it is highly customizable so that specific platform can perform 
> any desired optimizations. For instance this load_acquire on PPC is 
> optimized:
> 
> template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte> 
> (volatile jbyte*  p) { register jbyte t = load(p); 
> inlasm_acquire_reg(t); return t; }
> 
> This overrides the whole load_acquire implementation to do something 
> custom. Platforms like x86 extensively use this for joined fencing 
> variants to optimize.
> 
> The default implementation of load_acquire() will make an atomic load() 
> followed by acquire() since the semantics is generalized. The 
> generalized semantics are defined using inlined postfix/prefix calls 
> after/before the atomic access, as defined here:
> 
> template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       { 
> OrderAccess::acquire(); }
> template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        { 
> OrderAccess::release(); }
> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  { 
> OrderAccess::release(); }
> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() { 
> OrderAccess::fence();   }
> 
> For platforms that simply wish to override what e.g. acquire means for a 
> joined ordered memory access in general, as different to calling stand 
> alone acquire(), the semantics can be easily overridden for a platform 
> such as windows like on windows:
> 
> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() { 
> OrderAccess::fence(); }
> 
> In this example, since Windows (now) has a compiler barrier for acquire, 
> but does not need it for joined accesses since volatile has stronger 
> guarantees on windows, this is enough to specialize that for joined 
> memory accesses, no extra protection is needed.
> 
> == Backward Compatibility and Transitioning ==
> 
> Since the newly proposed code is structured differently to before, a 
> #define was added for backward compatibility so that external 
> repositories not adhering to this new architecture do not break. 
> Furthermore, store_release was declared private and marked as 
> deprecated. This allows for a smooth transition into the new style of 
> OrderAccess. When the full transition is made in all known repos, the 
> #define and store_fence may be safely removed, eventually.
> 
> == Documentation ==
> 
> The documentation seems old and outdated, describing how it works on 
> SPARC RMO and IA64, which are nowhere to be found in the repository. It 
> also describes things about C++ volatiles which cannot be relied upon. 
> The documentation has been cleaned up to match the current state of the 
> implementation better, with architectures actually found in the repository.
> 
> == Testing ==
> 
> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
> Ran some DaCapo benchmarks (I know okay :p) for performance regression 
> and there was no perceivable difference.
> 
> Looking forward to feedback on this, and hope to get some reviews. :)
> 
> Thanks,
> Erik