RFR: 7143664: Clean up OrderAccess implementations and usage

Fri Feb 20 18:24:00 UTC 2015

Hi Karen,

On 20/02/15 17:04, Karen Kinnear wrote:
> Erik,
>
> I am so grateful you tackled this problem  I very much appreciate the additional potential
> stability with the fixes.
>
> Thank you to you and David for walking through the compiler and platform details, and
> for checking that the generated code here makes sense.

No problem! :)

> Comments/questions on webrev.v3 open
> 0. orderAccess.hpp
> Thank you for the updated comments and table - very helpful!
>
> 1. orderAccess.hpp
> Were you going to add a comment about MSVC assumptions under C++ Volatile Semantics?
>
> Is it worth documenting in source somewhere what minimal versions of compilers are assumed?
> Or which versions of compilers were tested? I think that would help us in future please.
> Maybe in the platform-specific files?

I did not intend to add this comment to shared. I thought that it would 
be enough to say that non-volatile to volatile constraints can not be 
trusted in general. Instead there is a comment explaining this anomaly 
in orderAccess_windows_x86.inline.hpp where it is exploited, since it is 
the only documented exception I know of.

If we start picking certain examples, it feels like we should rather 
have either 1) a comprehensive table of what volatile actually means for 
different compilers with references or question marks where we don't 
know (and hence need to be conservative assuming the worst), or 2) 
putting such comments in platform specific files (like MSVC currently).

Do you agree? In that case, which one do you prefer?

As for documenting compiler versions assumed, would it be enough to 
stamp the code with the currently used compilers we used now, even 
though it might actually have worked with earlier versions? The 
reasoning is that it's unlikely we will go back to using earlier 
compilers anyway, so finding out when exactly certain compilers started 
doing what we want them to could possibly be unnecessary, while still 
having enough information to allow checking if future compilers have 
provided improved functionality from this version of OrderAccess?

> 2. orderAccess_windows_x86.inline.hpp
> Could you possibly add to the comment about MSVC, that this assumes all of the Scoped templates explicitly cast the addresses to volatiles?
> (In case someone changes that some day?)

Sure! But is this not clear given the comment in 
orderAccess_windows_x86.inline.hpp where this is exploited next to the 
scope template specialization:

// Note that in MSVC, volatile memory accesses are explicitly
// guaranteed to have acquire release semantics (w.r.t. compiler
// reordering) and therefore does not even need a compiler barrier
// for normal acquire release accesses.
template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() { 
OrderAccess::fence(); }

If you don't think this is clear enough, I would happily improve the 
description.

> 3. orderAccess_windows_x86.inline.hpp
> So I looked up _ReadWriteBarrier and VS2012 and VS2013 say this is deprecated.
> JDK8 builds on VS2010, so ok. JDK9 appears to be planning to move to VS2013, so we may need to change this.
> It appears that the recommendation is volatile (with default on x86 of /volatile:ms). Can we use __asm__volatile("" : : : "memory";
> compiler _barrier for VS2013 or is there an equivalent you know about?

Unfortunately I currently have no access to windows platforms so I can't 
experiment around. But I did some research when I looked into it and it 
seems like _ReadWriteBarrier is all we got for now until we adopt C++11 
std::atomic that Microsoft in the deprecation message on their website 
suggests we should use instead.

It seems to have been deprecated because programmers found it confusing 
- the name does not suggest it's a compiler-only ordering constraint. 
But that's fine for our purposes.

I just assumed that for now we probably don't want to make such a 
radical move to C++11, right? Since it was deprecation or C++11 I 
intentionally used _ReadWriteBarrier anyway even though it is 
deprecated. Is this a problem?

When we need to switch, I believe we will be forced to use C++11 on 
windows whether we like it or not. It has std::atomic_signal_fence which 
should work as a drop-in replacement for _ReadWriteBarrier AFAIK. But 
then again, we might as well just map all calls to std::atomic if we 
gotta use C++11 anyway...

> 4. orderAccess_windows_x86.inline.hpp, OrderAccess_bsd_x86.inline.hpp
> specialized_release_store_fence<jfloat> gets turned into a jint*?
> I have avoided FP all my life, so I had assumed that the old *p = v when declared jfloat*/jfloat would use floating point
> and the new one would cast to an int? Or does this work in the debugger? If so, a comment would be helpful.

Sorry I'm not sure I understand what you mean here. The code casts 
jfloat to jint in the same way that it always casted jdouble to jlong. 
Is this a problem?

> So - do you have plans to do additional changes to the OrderAccess logic? Beyond getting rid of the obsolete entries
> when all platforms are ok with it?

My biggest concern was correctness and I felt it really had to be done. 
I use OrderAccess for my own GC code a lot, and needed to tame it a bit 
to work, and thought I might as well contribute the fixes too. ;)

Having looked into the code a bit I know there is still room 
improvements in places if you would like me to look into it. Some examples:

<possibleImprovements>

1) x86_64 windows - I suspect this can be improved a /lot/. It seems 
like compilers were broken when the code was written at the very 
transition point to 64 bit. I believe this can be optimized to use 
compiler support I expect is in place today.

2) x86 on solaris: solaris studio was also crippled in the past with no 
inline assembly. I believe these optimizations we see on other platforms 
could now be ported to solaris as well.

3) _volatile variants of acquire/release/fence/membars: Now we are 
conservative, taking volatile to non-volatile reordering into account 
using compiler barriers. If the programmer knows this is not needed 
since all data published is volatile, maybe it could be good to have 
variants explicitly stating it won't be necessary.

4) Joining more store(); release(); pairs to store_release() since some 
platforms like ARM can then use stlr instead of full dmb fence which 
seems like a good thing in general anyway.

All previous suggestions revolve around performance (and I don't know 
how much we need it). But could also do other things.

Testing:

5) Testing code: it is possible using some light template 
metaprogramming to test if the expected specializations of OrderAccess 
are indeed used, a bit like @Override in Java - asserting that the 
generalized behaviour is indeed specialized as expected for all types we 
expect them to be specialized for.

Further refactoring:

6) Forwarding all atomic memory accesses to Atomic:: since it's really a 
separate concern. Atomic:: should know how to make atomic accesses for 
types used by OrderAccess, and OrderAccess should in shared code just 
forward it. Currently Atomic supports all stores but only load of jlong, 
making assumptions the rest can use volatile only - an assumption I'm 
not willing to make and spread around in the JVM.

</possibleImprovements>

The main problem though is that OrderAcces changes are often inherently 
platform specific, and I have very limited access as an outsider to the 
platforms.
This time, Jesper has been very kind to me, running JPRT for me which 
made these changes possible, and I'm very happy about that. But maybe I 
should do less platform specific things when I can't run JPRT. I don't 
want to bother people with compiling code for me too much, especially 
not if experimenting with what new compiler features have been added 
over the years. :)

> Looks like we have a couple of follow-up exercises:
> 1) check our is_MP usage to ensure we get the compiler_barriers in both paths if we are not using volatile declarations for
> all compiler ordering
> 2) try applying these templates to any other platforms to see how well they work for us as maintainers
> 3) figure out why we use StubRoutines::fence on amd64 Windows but on linux we just use lock; addl 0(sp), MSVC limitation?

Yes I could volunteer to do 2 (only open variants of course) and 3 if 
anyone would be kind to run JPRT for me and/or provide me with machines 
I can try things out on (which I suspect is impossible?).

> I did not review ppc, zero, arm.

I should mention Zero changes were reviewed by Severin Gehwolf in the 
zero-dev list. He did not have any problem with my changes.

Thanks for the review Karen! :)

/Erik

>
> thanks!
> Karen
> On Jan 22, 2015, at 8:19 PM, Erik Österlund wrote:
>
>> Hi all,
>>
>> == Summary of Changes ==
>>
>> This changeset fixes issues in OrderAccess on multiple levels from the
>> memory model semantics to compiler reorderings, to addressing
>> maintainability/portability issues which (almost) had to be fixed in
>> order to fix the correctness issues. It is the result of discussions
>> found in the previous "OrderAccess Refactoring" thread:
>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>
>> Bug ID:
>> https://bugs.openjdk.java.net/browse/JDK-7143664
>> (updated to reflect these related changes)
>>
>> Webrev:
>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>
>> Before I describe more I would like to give special thanks to David
>> Holmes for long discussions leading up to the currently proposed
>> changes. I would also like to thank Jesper Wilhelmsson for helping me
>> run my changes through JPRT.
>>
>> == Motivation ==
>>
>> This change directly fixes a reported OrderAccess bug due to compiler
>> reorderings which is still a vulnerability on almost all TSO platforms:
>> https://bugs.openjdk.java.net/browse/JDK-806196
>>
>> And directly fixes confusions like release_store() != release() store()
>> due to memory model issues previously described in this bug ID.
>>
>> At the same time it provides clearer design with separation of concerns
>> and generalization/specialization, removing a whole bunch of platform
>> specific code which could be generalized. The platform specific files
>> now only have a few LoC requirements (~7) to conform to the memory model
>> by specifying what the stand alone barriers do. Then optionally
>> optimizations to the general approach are possible if platforms support
>> it. This also makes it much easier to port to new platforms.
>>
>> == Memory Model ==
>>
>> The current definitions of acquire/release semantics are a bit fishy
>> leading to problems previously described in the bug ID (release_store()
>> != release() store()) and some other correctness issues. It has
>> therefore been replaced with a new model. I would like to thank David
>> Holmes for the long discussions leading up to the newly proposed model.
>>
>> The new model is formally defined like this:
>>
>> // T1: access_shared_data
>> // T1: ]release
>> // T1: (...)
>> // T1: store(X)
>> //
>> // T2: load(X)
>> // T2: (...)
>> // T2: acquire[
>> // T2: access_shared_data
>> //
>> // It is guaranteed that if T2: load(X) synchronizes with (observes the
>> // value written by) T1: store(X), then the memory accesses before the
>> // T1: ]release happen before the memory accesses after the T2: acquire[.
>>
>> The orderAccess.hpp file and bug ID also has a few additional
>> explanations making it more intuitive to the user when to use
>> acquire/release and the resemblance to TSO abstract machines. Big thanks
>> goes to David Holmes for discussing the memory model with me, which
>> helped a lot in deriving it.
>>
>> Now it holds that release() store() == release_store(), and the other
>> correctness issues are fixed as well.
>>
>> The new model is also closer to C++11 definitions which could give us
>> more relaxed compiler reordering constraints in the future when compiler
>> support for C++11 is there and ready.
>>
>> == Reliance on C++ Volatile Semantics ==
>>
>> The C++ standard section 1.9 "Program Execution" is very vague about
>> what the keyword volatile can actually do for us. It gives clear
>> constraints in terms of volatile-volatile accesses but says little about
>> nonvolatile-volatile accesses. Yet the current implementation heavily
>> relies upon volatile to in terms of compiler reordering. But GCC
>> explicitly declares that volatiles and non-volatiles may reorder freely
>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler
>> known to explicitly provide the wanted semantics with volatile is MSVC
>> 2010 for windows (
>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>> Compilers not giving explicit guarantees, must be considered unsafe and
>> revert to conservative behaviour.
>>
>> This was brought to attention after causing bugs, but was only fixed for
>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>> except windows, and has to be fixed on all of them.
>>
>> Several barriers are unsafe to use because they lack compiler reordering
>> constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms
>> they are typically implemented using dummy loads and stores. This seems
>> to be another old volatile reliance that I fixed. These barriers
>> sometimes have omitted compiler barriers (which is what we really want).
>> This seems to be another example on incorrect reliance on the volatile
>> semantics to help us. Therefore dummy loads/stores have been replaced
>> with compiler barriers on TSO platforms.
>>
>> It is also worth noting that compilers like sun studio did previously
>> not support inline asm syntax. Therefore, barriers were implemented in
>> .il-files. However, using them does not give explicit compiler
>> constraints for reordering AFAIK. Therefore, they have been
>> reimplemented using inline asm with explicit compiler reordering
>> constraints, as even sun (solaris?) studio now supports this.
>>
>> The windows variants have added a windows-style _ReadWriteBarrier()
>> compiler barrier similarly.
>>
>> == Strange Hardware Reorderings ==
>>
>> Fixed a weird inconsistency where acquire, loadstore and loadlaod would
>> use isync instead of lwsync for PPC on linux_zero, but not in any other
>> PPC platform in the repo. I assumed this is wrong and changed it to
>> lwsync instead.
>>
>> == Code Redundancy and Refactoring ==
>>
>> The OrderAccess code looks like it has been developed over a long period
>> of time, with small incremental changes. This seems to have led to a lot
>> of code duplication over time. For example, store_fence variants are not
>> referenced from anywhere, yet contribute to a lot of the code base and a
>> lot of awkwardness (such as being the one only exception not using
>> volatiles for some reason). Moreover, store_fence is not used anywhere
>> in hotspot because it is not a good fit for either the acquire/release
>> semantics or the Java volatile semantics, leaving a big question mark on
>> when it should ever be used. I took the liberty of removing it.
>>
>> Another redundancy issue is that most of the semantics is exactly the
>> same for all platforms, yet all that default boilerplate such as how to
>> make atomic accesses, where acquire/release is supposed to be placed
>> w.r.t. the memory access, what the different barriers should do etc. is
>> copied in redundantly for each os_cpu and each type of memory access for
>> each os_cpu. This makes it extremely painful 1) to understand what
>> actually defines a certain platform compared to the defaults and 2) to
>> correct bugs like those discovered here 3) prevent silly mistakes and
>> bugs, by simply having a lot less code defining the behaviour of
>> OrderAccess that could go wrong.
>>
>> A new architecture/design for OrderAccess is proposed, using a
>> generalization/specialization approach.
>>
>> A general implementation in /share/ defines how things work and splits
>> into different concerns: 1) how to make an atomic memory access, 2)
>> where to but barriers w.r.t. the memory access for things like
>> load_acquire, release_store and release_store_fence, 3) how these
>> barriers are defined.
>>
>> This allows a clear split between what is required for following the
>> specifications, and optimizations, which become much more readable and
>> only optimizations need to be reviewed in depth as the defaults can
>> always be trusted given correct standalone barriers.
>>
>> The only thing a platform is required to specify, is what an
>> implementation of acquire(), release() and fence() should do. If they
>> are implemented properly, everything in OrderAccess is guaranteed to
>> work according to specification using the generalized code. This makes
>> it very easy to support new ports. ALL the other code in the os_cpu
>> files is used /only/ for optimization purposes offered for specific
>> configurations.
>>
>> However, it is highly customizable so that specific platform can perform
>> any desired optimizations. For instance this load_acquire on PPC is
>> optimized:
>>
>> template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte>
>> (volatile jbyte*  p) { register jbyte t = load(p);
>> inlasm_acquire_reg(t); return t; }
>>
>> This overrides the whole load_acquire implementation to do something
>> custom. Platforms like x86 extensively use this for joined fencing
>> variants to optimize.
>>
>> The default implementation of load_acquire() will make an atomic load()
>> followed by acquire() since the semantics is generalized. The
>> generalized semantics are defined using inlined postfix/prefix calls
>> after/before the atomic access, as defined here:
>>
>> template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>> OrderAccess::acquire(); }
>> template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        {
>> OrderAccess::release(); }
>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>> OrderAccess::release(); }
>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>> OrderAccess::fence();   }
>>
>> For platforms that simply wish to override what e.g. acquire means for a
>> joined ordered memory access in general, as different to calling stand
>> alone acquire(), the semantics can be easily overridden for a platform
>> such as windows like on windows:
>>
>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>> OrderAccess::fence(); }
>>
>> In this example, since Windows (now) has a compiler barrier for acquire,
>> but does not need it for joined accesses since volatile has stronger
>> guarantees on windows, this is enough to specialize that for joined
>> memory accesses, no extra protection is needed.
>>
>> == Backward Compatibility and Transitioning ==
>>
>> Since the newly proposed code is structured differently to before, a
>> #define was added for backward compatibility so that external
>> repositories not adhering to this new architecture do not break.
>> Furthermore, store_release was declared private and marked as
>> deprecated. This allows for a smooth transition into the new style of
>> OrderAccess. When the full transition is made in all known repos, the
>> #define and store_fence may be safely removed, eventually.
>>
>> == Documentation ==
>>
>> The documentation seems old and outdated, describing how it works on
>> SPARC RMO and IA64, which are nowhere to be found in the repository. It
>> also describes things about C++ volatiles which cannot be relied upon.
>> The documentation has been cleaned up to match the current state of the
>> implementation better, with architectures actually found in the repository.
>>
>> == Testing ==
>>
>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
>> Ran some DaCapo benchmarks (I know okay :p) for performance regression
>> and there was no perceivable difference.
>>
>> Looking forward to feedback on this, and hope to get some reviews. :)
>>
>> Thanks,
>> Erik
>
>