RFR: 7143664: Clean up OrderAccess implementations and usage
Erik Österlund
erik.osterlund at lnu.se
Fri Jan 23 01:19:12 UTC 2015
Hi all,
== Summary of Changes ==
This changeset fixes issues in OrderAccess on multiple levels from the
memory model semantics to compiler reorderings, to addressing
maintainability/portability issues which (almost) had to be fixed in
order to fix the correctness issues. It is the result of discussions
found in the previous "OrderAccess Refactoring" thread:
http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
Bug ID:
https://bugs.openjdk.java.net/browse/JDK-7143664
(updated to reflect these related changes)
Webrev:
http://cr.openjdk.java.net/~dholmes/7143664/webrev/
Before I describe more I would like to give special thanks to David
Holmes for long discussions leading up to the currently proposed
changes. I would also like to thank Jesper Wilhelmsson for helping me
run my changes through JPRT.
== Motivation ==
This change directly fixes a reported OrderAccess bug due to compiler
reorderings which is still a vulnerability on almost all TSO platforms:
https://bugs.openjdk.java.net/browse/JDK-806196
And directly fixes confusions like release_store() != release() store()
due to memory model issues previously described in this bug ID.
At the same time it provides clearer design with separation of concerns
and generalization/specialization, removing a whole bunch of platform
specific code which could be generalized. The platform specific files
now only have a few LoC requirements (~7) to conform to the memory model
by specifying what the stand alone barriers do. Then optionally
optimizations to the general approach are possible if platforms support
it. This also makes it much easier to port to new platforms.
== Memory Model ==
The current definitions of acquire/release semantics are a bit fishy
leading to problems previously described in the bug ID (release_store()
!= release() store()) and some other correctness issues. It has
therefore been replaced with a new model. I would like to thank David
Holmes for the long discussions leading up to the newly proposed model.
The new model is formally defined like this:
// T1: access_shared_data
// T1: ]release
// T1: (...)
// T1: store(X)
//
// T2: load(X)
// T2: (...)
// T2: acquire[
// T2: access_shared_data
//
// It is guaranteed that if T2: load(X) synchronizes with (observes the
// value written by) T1: store(X), then the memory accesses before the
// T1: ]release happen before the memory accesses after the T2: acquire[.
The orderAccess.hpp file and bug ID also has a few additional
explanations making it more intuitive to the user when to use
acquire/release and the resemblance to TSO abstract machines. Big thanks
goes to David Holmes for discussing the memory model with me, which
helped a lot in deriving it.
Now it holds that release() store() == release_store(), and the other
correctness issues are fixed as well.
The new model is also closer to C++11 definitions which could give us
more relaxed compiler reordering constraints in the future when compiler
support for C++11 is there and ready.
== Reliance on C++ Volatile Semantics ==
The C++ standard section 1.9 "Program Execution" is very vague about
what the keyword volatile can actually do for us. It gives clear
constraints in terms of volatile-volatile accesses but says little about
nonvolatile-volatile accesses. Yet the current implementation heavily
relies upon volatile to in terms of compiler reordering. But GCC
explicitly declares that volatiles and non-volatiles may reorder freely
( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler
known to explicitly provide the wanted semantics with volatile is MSVC
2010 for windows (
https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
Compilers not giving explicit guarantees, must be considered unsafe and
revert to conservative behaviour.
This was brought to attention after causing bugs, but was only fixed for
x86 linux. This is a fundamental issue inherent to all TSO platforms
except windows, and has to be fixed on all of them.
Several barriers are unsafe to use because they lack compiler reordering
constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms
they are typically implemented using dummy loads and stores. This seems
to be another old volatile reliance that I fixed. These barriers
sometimes have omitted compiler barriers (which is what we really want).
This seems to be another example on incorrect reliance on the volatile
semantics to help us. Therefore dummy loads/stores have been replaced
with compiler barriers on TSO platforms.
It is also worth noting that compilers like sun studio did previously
not support inline asm syntax. Therefore, barriers were implemented in
.il-files. However, using them does not give explicit compiler
constraints for reordering AFAIK. Therefore, they have been
reimplemented using inline asm with explicit compiler reordering
constraints, as even sun (solaris?) studio now supports this.
The windows variants have added a windows-style _ReadWriteBarrier()
compiler barrier similarly.
== Strange Hardware Reorderings ==
Fixed a weird inconsistency where acquire, loadstore and loadlaod would
use isync instead of lwsync for PPC on linux_zero, but not in any other
PPC platform in the repo. I assumed this is wrong and changed it to
lwsync instead.
== Code Redundancy and Refactoring ==
The OrderAccess code looks like it has been developed over a long period
of time, with small incremental changes. This seems to have led to a lot
of code duplication over time. For example, store_fence variants are not
referenced from anywhere, yet contribute to a lot of the code base and a
lot of awkwardness (such as being the one only exception not using
volatiles for some reason). Moreover, store_fence is not used anywhere
in hotspot because it is not a good fit for either the acquire/release
semantics or the Java volatile semantics, leaving a big question mark on
when it should ever be used. I took the liberty of removing it.
Another redundancy issue is that most of the semantics is exactly the
same for all platforms, yet all that default boilerplate such as how to
make atomic accesses, where acquire/release is supposed to be placed
w.r.t. the memory access, what the different barriers should do etc. is
copied in redundantly for each os_cpu and each type of memory access for
each os_cpu. This makes it extremely painful 1) to understand what
actually defines a certain platform compared to the defaults and 2) to
correct bugs like those discovered here 3) prevent silly mistakes and
bugs, by simply having a lot less code defining the behaviour of
OrderAccess that could go wrong.
A new architecture/design for OrderAccess is proposed, using a
generalization/specialization approach.
A general implementation in /share/ defines how things work and splits
into different concerns: 1) how to make an atomic memory access, 2)
where to but barriers w.r.t. the memory access for things like
load_acquire, release_store and release_store_fence, 3) how these
barriers are defined.
This allows a clear split between what is required for following the
specifications, and optimizations, which become much more readable and
only optimizations need to be reviewed in depth as the defaults can
always be trusted given correct standalone barriers.
The only thing a platform is required to specify, is what an
implementation of acquire(), release() and fence() should do. If they
are implemented properly, everything in OrderAccess is guaranteed to
work according to specification using the generalized code. This makes
it very easy to support new ports. ALL the other code in the os_cpu
files is used /only/ for optimization purposes offered for specific
configurations.
However, it is highly customizable so that specific platform can perform
any desired optimizations. For instance this load_acquire on PPC is
optimized:
template<> inline jbyte OrderAccess::specialized_load_acquire<jbyte>
(volatile jbyte* p) { register jbyte t = load(p);
inlasm_acquire_reg(t); return t; }
This overrides the whole load_acquire implementation to do something
custom. Platforms like x86 extensively use this for joined fencing
variants to optimize.
The default implementation of load_acquire() will make an atomic load()
followed by acquire() since the semantics is generalized. The
generalized semantics are defined using inlined postfix/prefix calls
after/before the atomic access, as defined here:
template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix() {
OrderAccess::acquire(); }
template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix() {
OrderAccess::release(); }
template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix() {
OrderAccess::release(); }
template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
OrderAccess::fence(); }
For platforms that simply wish to override what e.g. acquire means for a
joined ordered memory access in general, as different to calling stand
alone acquire(), the semantics can be easily overridden for a platform
such as windows like on windows:
template<> inline void ScopedFence<X_ACQUIRE>::postfix() { }
template<> inline void ScopedFence<RELEASE_X>::prefix() { }
template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix() { }
template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
OrderAccess::fence(); }
In this example, since Windows (now) has a compiler barrier for acquire,
but does not need it for joined accesses since volatile has stronger
guarantees on windows, this is enough to specialize that for joined
memory accesses, no extra protection is needed.
== Backward Compatibility and Transitioning ==
Since the newly proposed code is structured differently to before, a
#define was added for backward compatibility so that external
repositories not adhering to this new architecture do not break.
Furthermore, store_release was declared private and marked as
deprecated. This allows for a smooth transition into the new style of
OrderAccess. When the full transition is made in all known repos, the
#define and store_fence may be safely removed, eventually.
== Documentation ==
The documentation seems old and outdated, describing how it works on
SPARC RMO and IA64, which are nowhere to be found in the repository. It
also describes things about C++ volatiles which cannot be relied upon.
The documentation has been cleaned up to match the current state of the
implementation better, with architectures actually found in the repository.
== Testing ==
JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
Ran some DaCapo benchmarks (I know okay :p) for performance regression
and there was no perceivable difference.
Looking forward to feedback on this, and hope to get some reviews. :)
Thanks,
Erik
More information about the hotspot-dev
mailing list