RFR: 7143664: Clean up OrderAccess implementations and usage

Fri Jan 23 01:19:12 UTC 2015

Hi all,

== Summary of Changes ==

This changeset fixes issues in OrderAccess on multiple levels from the 
memory model semantics to compiler reorderings, to addressing 
maintainability/portability issues which (almost) had to be fixed in 
order to fix the correctness issues. It is the result of discussions 
found in the previous "OrderAccess Refactoring" thread: 
http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html

Bug ID:
https://bugs.openjdk.java.net/browse/JDK-7143664
(updated to reflect these related changes)

Webrev:
http://cr.openjdk.java.net/~dholmes/7143664/webrev/

Before I describe more I would like to give special thanks to David 
Holmes for long discussions leading up to the currently proposed 
changes. I would also like to thank Jesper Wilhelmsson for helping me 
run my changes through JPRT.

== Motivation ==

This change directly fixes a reported OrderAccess bug due to compiler 
reorderings which is still a vulnerability on almost all TSO platforms:
https://bugs.openjdk.java.net/browse/JDK-806196

And directly fixes confusions like release_store() != release() store() 
due to memory model issues previously described in this bug ID.

At the same time it provides clearer design with separation of concerns 
and generalization/specialization, removing a whole bunch of platform 
specific code which could be generalized. The platform specific files 
now only have a few LoC requirements (~7) to conform to the memory model 
by specifying what the stand alone barriers do. Then optionally 
optimizations to the general approach are possible if platforms support 
it. This also makes it much easier to port to new platforms.

== Memory Model ==

The current definitions of acquire/release semantics are a bit fishy 
leading to problems previously described in the bug ID (release_store() 
!= release() store()) and some other correctness issues. It has 
therefore been replaced with a new model. I would like to thank David 
Holmes for the long discussions leading up to the newly proposed model.

The new model is formally defined like this:

// T1: access_shared_data
// T1: ]release
// T1: (...)
// T1: store(X)
//
// T2: load(X)
// T2: (...)
// T2: acquire[
// T2: access_shared_data
//
// It is guaranteed that if T2: load(X) synchronizes with (observes the
// value written by) T1: store(X), then the memory accesses before the
// T1: ]release happen before the memory accesses after the T2: acquire[.

The orderAccess.hpp file and bug ID also has a few additional 
explanations making it more intuitive to the user when to use 
acquire/release and the resemblance to TSO abstract machines. Big thanks 
goes to David Holmes for discussing the memory model with me, which 
helped a lot in deriving it.

Now it holds that release() store() == release_store(), and the other 
correctness issues are fixed as well.

The new model is also closer to C++11 definitions which could give us 
more relaxed compiler reordering constraints in the future when compiler 
support for C++11 is there and ready.

== Reliance on C++ Volatile Semantics ==

The C++ standard section 1.9 "Program Execution" is very vague about 
what the keyword volatile can actually do for us. It gives clear 
constraints in terms of volatile-volatile accesses but says little about 
nonvolatile-volatile accesses. Yet the current implementation heavily 
relies upon volatile to in terms of compiler reordering. But GCC 
explicitly declares that volatiles and non-volatiles may reorder freely 
( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler 
known to explicitly provide the wanted semantics with volatile is MSVC 
2010 for windows ( 
https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ). 
Compilers not giving explicit guarantees, must be considered unsafe and 
revert to conservative behaviour.

This was brought to attention after causing bugs, but was only fixed for 
x86 linux. This is a fundamental issue inherent to all TSO platforms 
except windows, and has to be fixed on all of them.

Several barriers are unsafe to use because they lack compiler reordering 
constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms 
they are typically implemented using dummy loads and stores. This seems 
to be another old volatile reliance that I fixed. These barriers 
sometimes have omitted compiler barriers (which is what we really want). 
This seems to be another example on incorrect reliance on the volatile 
semantics to help us. Therefore dummy loads/stores have been replaced 
with compiler barriers on TSO platforms.

It is also worth noting that compilers like sun studio did previously 
not support inline asm syntax. Therefore, barriers were implemented in 
.il-files. However, using them does not give explicit compiler 
constraints for reordering AFAIK. Therefore, they have been 
reimplemented using inline asm with explicit compiler reordering 
constraints, as even sun (solaris?) studio now supports this.

The windows variants have added a windows-style _ReadWriteBarrier() 
compiler barrier similarly.

== Strange Hardware Reorderings ==

Fixed a weird inconsistency where acquire, loadstore and loadlaod would 
use isync instead of lwsync for PPC on linux_zero, but not in any other 
PPC platform in the repo. I assumed this is wrong and changed it to 
lwsync instead.

== Code Redundancy and Refactoring ==

The OrderAccess code looks like it has been developed over a long period 
of time, with small incremental changes. This seems to have led to a lot 
of code duplication over time. For example, store_fence variants are not 
referenced from anywhere, yet contribute to a lot of the code base and a 
lot of awkwardness (such as being the one only exception not using 
volatiles for some reason). Moreover, store_fence is not used anywhere 
in hotspot because it is not a good fit for either the acquire/release 
semantics or the Java volatile semantics, leaving a big question mark on 
when it should ever be used. I took the liberty of removing it.

Another redundancy issue is that most of the semantics is exactly the 
same for all platforms, yet all that default boilerplate such as how to 
make atomic accesses, where acquire/release is supposed to be placed 
w.r.t. the memory access, what the different barriers should do etc. is 
copied in redundantly for each os_cpu and each type of memory access for 
each os_cpu. This makes it extremely painful 1) to understand what 
actually defines a certain platform compared to the defaults and 2) to 
correct bugs like those discovered here 3) prevent silly mistakes and 
bugs, by simply having a lot less code defining the behaviour of 
OrderAccess that could go wrong.

A new architecture/design for OrderAccess is proposed, using a 
generalization/specialization approach.

A general implementation in /share/ defines how things work and splits 
into different concerns: 1) how to make an atomic memory access, 2) 
where to but barriers w.r.t. the memory access for things like 
load_acquire, release_store and release_store_fence, 3) how these 
barriers are defined.

This allows a clear split between what is required for following the 
specifications, and optimizations, which become much more readable and 
only optimizations need to be reviewed in depth as the defaults can 
always be trusted given correct standalone barriers.

The only thing a platform is required to specify, is what an 
implementation of acquire(), release() and fence() should do. If they 
are implemented properly, everything in OrderAccess is guaranteed to 
work according to specification using the generalized code. This makes 
it very easy to support new ports. ALL the other code in the os_cpu 
files is used /only/ for optimization purposes offered for specific 
configurations.

However, it is highly customizable so that specific platform can perform 
any desired optimizations. For instance this load_acquire on PPC is 
optimized:

template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte> 
(volatile jbyte*  p) { register jbyte t = load(p); 
inlasm_acquire_reg(t); return t; }

This overrides the whole load_acquire implementation to do something 
custom. Platforms like x86 extensively use this for joined fencing 
variants to optimize.

The default implementation of load_acquire() will make an atomic load() 
followed by acquire() since the semantics is generalized. The 
generalized semantics are defined using inlined postfix/prefix calls 
after/before the atomic access, as defined here:

template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       { 
OrderAccess::acquire(); }
template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        { 
OrderAccess::release(); }
template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  { 
OrderAccess::release(); }
template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() { 
OrderAccess::fence();   }

For platforms that simply wish to override what e.g. acquire means for a 
joined ordered memory access in general, as different to calling stand 
alone acquire(), the semantics can be easily overridden for a platform 
such as windows like on windows:

template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() { 
OrderAccess::fence(); }

In this example, since Windows (now) has a compiler barrier for acquire, 
but does not need it for joined accesses since volatile has stronger 
guarantees on windows, this is enough to specialize that for joined 
memory accesses, no extra protection is needed.

== Backward Compatibility and Transitioning ==

Since the newly proposed code is structured differently to before, a 
#define was added for backward compatibility so that external 
repositories not adhering to this new architecture do not break. 
Furthermore, store_release was declared private and marked as 
deprecated. This allows for a smooth transition into the new style of 
OrderAccess. When the full transition is made in all known repos, the 
#define and store_fence may be safely removed, eventually.

== Documentation ==

The documentation seems old and outdated, describing how it works on 
SPARC RMO and IA64, which are nowhere to be found in the repository. It 
also describes things about C++ volatiles which cannot be relied upon. 
The documentation has been cleaned up to match the current state of the 
implementation better, with architectures actually found in the repository.

== Testing ==

JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
Ran some DaCapo benchmarks (I know okay :p) for performance regression 
and there was no perceivable difference.

Looking forward to feedback on this, and hope to get some reviews. :)

Thanks,
Erik