RFR: 7143664: Clean up OrderAccess implementations and usage

Tue Feb 24 20:44:20 UTC 2015

Thanks for the review! Do you happen to know the compiler versions used 
to try this so I can document the platform specific files with that 
information?

/Erik

On 24/02/15 15:40, Lindenmaier, Goetz wrote:
> Hi,
>
> we had a look at the changes and are fine with them.
> I tested the build, it works on aix and linux, and simple tests
> are fine.  I'll run the patch tonight with our nightly tests, but I don't
> expect any problems.
>
> Thanks for pinging us!
>
> Best regards,
>    Goetz.
>
> -----Original Message-----
> From: hotspot-dev [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of David Holmes
> Sent: Dienstag, 24. Februar 2015 03:31
> To: Karen Kinnear; Erik Österlund
> Cc: hotspot-dev developers
> Subject: Re: RFR: 7143664: Clean up OrderAccess implementations and usage
>
> On 24/02/2015 4:45 AM, Karen Kinnear wrote:
>>
>> On Feb 20, 2015, at 1:24 PM, Erik Österlund wrote:
>>
>> And truly many thanks - in my mind you have made this much more conceptually clean and safer.
>> I am good with your checking in the parts I have reviewed.
>
> Thanks Karen! So unless someone has something they really want to be
> changed now (and I think everyone has so far said there are no strOng
> objections to anything) then I will push this version:
>
> http://cr.openjdk.java.net/~dholmes/7143664/webrev.v4/
>
> in a couple of days.
>
> Oh wait - we still haven't heard anything from ppc64 folk - pinging
> Volker (cc'd :) ).
>
> Thanks,
> David
>
>
>>> Hi Karen,
>>>
>>> On 20/02/15 17:04, Karen Kinnear wrote:
>>>> Erik,
>>>>
>>>> 1. orderAccess.hpp
>>>> Were you going to add a comment about MSVC assumptions under C++ Volatile Semantics?
>>>>
>>>> Is it worth documenting in source somewhere what minimal versions of compilers are assumed?
>>>> Or which versions of compilers were tested? I think that would help us in future please.
>>>> Maybe in the platform-specific files?
>>>
>>> I did not intend to add this comment to shared. I thought that it would
>>> be enough to say that non-volatile to volatile constraints can not be
>>> trusted in general. Instead there is a comment explaining this anomaly
>>> in orderAccess_windows_x86.inline.hpp where it is exploited, since it is
>>> the only documented exception I know of.
>>>
>>> If we start picking certain examples, it feels like we should rather
>>> have either 1) a comprehensive table of what volatile actually means for
>>> different compilers with references or question marks where we don't
>>> know (and hence need to be conservative assuming the worst), or 2)
>>> putting such comments in platform specific files (like MSVC currently).
>>>
>>> Do you agree? In that case, which one do you prefer?
>>>
>>> As for documenting compiler versions assumed, would it be enough to
>>> stamp the code with the currently used compilers we used now, even
>>> though it might actually have worked with earlier versions? The
>>> reasoning is that it's unlikely we will go back to using earlier
>>> compilers anyway, so finding out when exactly certain compilers started
>>> doing what we want them to could possibly be unnecessary, while still
>>> having enough information to allow checking if future compilers have
>>> provided improved functionality from this version of OrderAccess?
>> It would make more sense to document for the individual platforms.
>> I'm trying to prevent future "what were they thinking" questions by recording
>> for each platform the specific compiler version tested. Not all versions that
>> should work, just the ones we tested it with.
>>>
>>>> 2. orderAccess_windows_x86.inline.hpp
>>>> Could you possibly add to the comment about MSVC, that this assumes all of the Scoped templates explicitly cast the addresses to volatiles?
>>>> (In case someone changes that some day?)
>>>
>>> Sure! But is this not clear given the comment in
>>> orderAccess_windows_x86.inline.hpp where this is exploited next to the
>>> scope template specialization:
>>>
>>> // Note that in MSVC, volatile memory accesses are explicitly
>>> // guaranteed to have acquire release semantics (w.r.t. compiler
>>> // reordering) and therefore does not even need a compiler barrier
>>> // for normal acquire release accesses.
>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>> OrderAccess::fence(); }
>>>
>>> If you don't think this is clear enough, I would happily improve the
>>> description.
>> Perhaps I am being extra cautious since the very original set of calls all assumed the passed in arguments were
>> already volatile, I wanted to save us this problem in 10 years by stating that this assumes that the Scoped templates explicitly
>> cast the addresses to volatiles  (which is different than the caller passing in volatiles). So to me there were two ways to get there
>> and I thought it would be valuable to specify this is the assumption. Not a big deal.
>>>
>>>> 3. orderAccess_windows_x86.inline.hpp
>>>> So I looked up _ReadWriteBarrier and VS2012 and VS2013 say this is deprecated.
>>>> JDK8 builds on VS2010, so ok. JDK9 appears to be planning to move to VS2013, so we may need to change this.
>>>> It appears that the recommendation is volatile (with default on x86 of /volatile:ms). Can we use __asm__volatile("" : : : "memory";
>>>> compiler _barrier for VS2013 or is there an equivalent you know about?
>>>
>>> Unfortunately I currently have no access to windows platforms so I can't
>>> experiment around. But I did some research when I looked into it and it
>>> seems like _ReadWriteBarrier is all we got for now until we adopt C++11
>>> std::atomic that Microsoft in the deprecation message on their website
>>> suggests we should use instead.
>>>
>>> It seems to have been deprecated because programmers found it confusing
>>> - the name does not suggest it's a compiler-only ordering constraint.
>>> But that's fine for our purposes.
>>>
>>> I just assumed that for now we probably don't want to make such a
>>> radical move to C++11, right? Since it was deprecation or C++11 I
>>> intentionally used _ReadWriteBarrier anyway even though it is
>>> deprecated. Is this a problem?
>>>
>>> When we need to switch, I believe we will be forced to use C++11 on
>>> windows whether we like it or not. It has std::atomic_signal_fence which
>>> should work as a drop-in replacement for _ReadWriteBarrier AFAIK. But
>>> then again, we might as well just map all calls to std::atomic if we
>>> gotta use C++11 anyway...
>> Thank you. Looks like an area we will need to follow-up on.
>> The C++ documentation says atomic_signal_fence just does atomic reordering.
>> The Visual Studio 2012 documentation is not clear on whether it also adds hardware fences.
>>
>> So - go ahead and leave this as is and we will need to switch later.
>>>
>>>> So - do you have plans to do additional changes to the OrderAccess logic? Beyond getting rid of the obsolete entries
>>>> when all platforms are ok with it?
>>>
>>> My biggest concern was correctness and I felt it really had to be done.
>>> I use OrderAccess for my own GC code a lot, and needed to tame it a bit
>>> to work, and thought I might as well contribute the fixes too. ;)
>>
>> Thank you.
>>>
>>> Having looked into the code a bit I know there is still room
>>> improvements in places if you would like me to look into it. Some examples:
>>>
>>> <possibleImprovements>
>>>
>>> 1) x86_64 windows - I suspect this can be improved a /lot/. It seems
>>> like compilers were broken when the code was written at the very
>>> transition point to 64 bit. I believe this can be optimized to use
>>> compiler support I expect is in place today.
>> Yes.
>>>
>>> 2) x86 on solaris: solaris studio was also crippled in the past with no
>>> inline assembly. I believe these optimizations we see on other platforms
>>> could now be ported to solaris as well.
>> Yes. yay!
>>>
>>> 3) _volatile variants of acquire/release/fence/membars: Now we are
>>> conservative, taking volatile to non-volatile reordering into account
>>> using compiler barriers. If the programmer knows this is not needed
>>> since all data published is volatile, maybe it could be good to have
>>> variants explicitly stating it won't be necessary.
>> Yes.
>>>
>>> 4) Joining more store(); release(); pairs to store_release() since some
>>> platforms like ARM can then use stlr instead of full dmb fence which
>>> seems like a good thing in general anyway.
>>>
>>> All previous suggestions revolve around performance (and I don't know
>>> how much we need it). But could also do other things.
>>>
>>> Testing:
>>>
>>> 5) Testing code: it is possible using some light template
>>> metaprogramming to test if the expected specializations of OrderAccess
>>> are indeed used, a bit like @Override in Java - asserting that the
>>> generalized behaviour is indeed specialized as expected for all types we
>>> expect them to be specialized for.
>>>
>>> Further refactoring:
>>>
>>> 6) Forwarding all atomic memory accesses to Atomic:: since it's really a
>>> separate concern. Atomic:: should know how to make atomic accesses for
>>> types used by OrderAccess, and OrderAccess should in shared code just
>>> forward it. Currently Atomic supports all stores but only load of jlong,
>>> making assumptions the rest can use volatile only - an assumption I'm
>>> not willing to make and spread around in the JVM.
>> I share your concerns.
>>>
>>> </possibleImprovements>
>>>
>>> The main problem though is that OrderAcces changes are often inherently
>>> platform specific, and I have very limited access as an outsider to the
>>> platforms.
>>> This time, Jesper has been very kind to me, running JPRT for me which
>>> made these changes possible, and I'm very happy about that. But maybe I
>>> should do less platform specific things when I can't run JPRT. I don't
>>> want to bother people with compiling code for me too much, especially
>>> not if experimenting with what new compiler features have been added
>>> over the years. :)
>>>
>>>> Looks like we have a couple of follow-up exercises:
>>>> 1) check our is_MP usage to ensure we get the compiler_barriers in both paths if we are not using volatile declarations for
>>>> all compiler ordering
>>>> 2) try applying these templates to any other platforms to see how well they work for us as maintainers
>>>> 3) figure out why we use StubRoutines::fence on amd64 Windows but on linux we just use lock; addl 0(sp), MSVC limitation?
>>>
>>> Yes I could volunteer to do 2 (only open variants of course) and 3 if
>>> anyone would be kind to run JPRT for me and/or provide me with machines
>>> I can try things out on (which I suspect is impossible?).
>> I wasn't asking you to do this work - I was suggesting work we should do. No worries.
>>>
>>>> I did not review ppc, zero, arm.
>>>
>>> I should mention Zero changes were reviewed by Severin Gehwolf in the
>>> zero-dev list. He did not have any problem with my changes.
>> Glad to hear that. Thank you.
>>>
>>> Thanks for the review Karen! :)
>>>
>>> /Erik
>>
>>
>> thanks,
>> Karen
>>>
>>>>
>>>> thanks!
>>>> Karen
>>>> On Jan 22, 2015, at 8:19 PM, Erik Österlund wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> == Summary of Changes ==
>>>>>
>>>>> This changeset fixes issues in OrderAccess on multiple levels from the
>>>>> memory model semantics to compiler reorderings, to addressing
>>>>> maintainability/portability issues which (almost) had to be fixed in
>>>>> order to fix the correctness issues. It is the result of discussions
>>>>> found in the previous "OrderAccess Refactoring" thread:
>>>>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>>>>
>>>>> Bug ID:
>>>>> https://bugs.openjdk.java.net/browse/JDK-7143664
>>>>> (updated to reflect these related changes)
>>>>>
>>>>> Webrev:
>>>>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>>>>
>>>>> Before I describe more I would like to give special thanks to David
>>>>> Holmes for long discussions leading up to the currently proposed
>>>>> changes. I would also like to thank Jesper Wilhelmsson for helping me
>>>>> run my changes through JPRT.
>>>>>
>>>>> == Motivation ==
>>>>>
>>>>> This change directly fixes a reported OrderAccess bug due to compiler
>>>>> reorderings which is still a vulnerability on almost all TSO platforms:
>>>>> https://bugs.openjdk.java.net/browse/JDK-806196
>>>>>
>>>>> And directly fixes confusions like release_store() != release() store()
>>>>> due to memory model issues previously described in this bug ID.
>>>>>
>>>>> At the same time it provides clearer design with separation of concerns
>>>>> and generalization/specialization, removing a whole bunch of platform
>>>>> specific code which could be generalized. The platform specific files
>>>>> now only have a few LoC requirements (~7) to conform to the memory model
>>>>> by specifying what the stand alone barriers do. Then optionally
>>>>> optimizations to the general approach are possible if platforms support
>>>>> it. This also makes it much easier to port to new platforms.
>>>>>
>>>>> == Memory Model ==
>>>>>
>>>>> The current definitions of acquire/release semantics are a bit fishy
>>>>> leading to problems previously described in the bug ID (release_store()
>>>>> != release() store()) and some other correctness issues. It has
>>>>> therefore been replaced with a new model. I would like to thank David
>>>>> Holmes for the long discussions leading up to the newly proposed model.
>>>>>
>>>>> The new model is formally defined like this:
>>>>>
>>>>> // T1: access_shared_data
>>>>> // T1: ]release
>>>>> // T1: (...)
>>>>> // T1: store(X)
>>>>> //
>>>>> // T2: load(X)
>>>>> // T2: (...)
>>>>> // T2: acquire[
>>>>> // T2: access_shared_data
>>>>> //
>>>>> // It is guaranteed that if T2: load(X) synchronizes with (observes the
>>>>> // value written by) T1: store(X), then the memory accesses before the
>>>>> // T1: ]release happen before the memory accesses after the T2: acquire[.
>>>>>
>>>>> The orderAccess.hpp file and bug ID also has a few additional
>>>>> explanations making it more intuitive to the user when to use
>>>>> acquire/release and the resemblance to TSO abstract machines. Big thanks
>>>>> goes to David Holmes for discussing the memory model with me, which
>>>>> helped a lot in deriving it.
>>>>>
>>>>> Now it holds that release() store() == release_store(), and the other
>>>>> correctness issues are fixed as well.
>>>>>
>>>>> The new model is also closer to C++11 definitions which could give us
>>>>> more relaxed compiler reordering constraints in the future when compiler
>>>>> support for C++11 is there and ready.
>>>>>
>>>>> == Reliance on C++ Volatile Semantics ==
>>>>>
>>>>> The C++ standard section 1.9 "Program Execution" is very vague about
>>>>> what the keyword volatile can actually do for us. It gives clear
>>>>> constraints in terms of volatile-volatile accesses but says little about
>>>>> nonvolatile-volatile accesses. Yet the current implementation heavily
>>>>> relies upon volatile to in terms of compiler reordering. But GCC
>>>>> explicitly declares that volatiles and non-volatiles may reorder freely
>>>>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler
>>>>> known to explicitly provide the wanted semantics with volatile is MSVC
>>>>> 2010 for windows (
>>>>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>>>>> Compilers not giving explicit guarantees, must be considered unsafe and
>>>>> revert to conservative behaviour.
>>>>>
>>>>> This was brought to attention after causing bugs, but was only fixed for
>>>>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>>>>> except windows, and has to be fixed on all of them.
>>>>>
>>>>> Several barriers are unsafe to use because they lack compiler reordering
>>>>> constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms
>>>>> they are typically implemented using dummy loads and stores. This seems
>>>>> to be another old volatile reliance that I fixed. These barriers
>>>>> sometimes have omitted compiler barriers (which is what we really want).
>>>>> This seems to be another example on incorrect reliance on the volatile
>>>>> semantics to help us. Therefore dummy loads/stores have been replaced
>>>>> with compiler barriers on TSO platforms.
>>>>>
>>>>> It is also worth noting that compilers like sun studio did previously
>>>>> not support inline asm syntax. Therefore, barriers were implemented in
>>>>> .il-files. However, using them does not give explicit compiler
>>>>> constraints for reordering AFAIK. Therefore, they have been
>>>>> reimplemented using inline asm with explicit compiler reordering
>>>>> constraints, as even sun (solaris?) studio now supports this.
>>>>>
>>>>> The windows variants have added a windows-style _ReadWriteBarrier()
>>>>> compiler barrier similarly.
>>>>>
>>>>> == Strange Hardware Reorderings ==
>>>>>
>>>>> Fixed a weird inconsistency where acquire, loadstore and loadlaod would
>>>>> use isync instead of lwsync for PPC on linux_zero, but not in any other
>>>>> PPC platform in the repo. I assumed this is wrong and changed it to
>>>>> lwsync instead.
>>>>>
>>>>> == Code Redundancy and Refactoring ==
>>>>>
>>>>> The OrderAccess code looks like it has been developed over a long period
>>>>> of time, with small incremental changes. This seems to have led to a lot
>>>>> of code duplication over time. For example, store_fence variants are not
>>>>> referenced from anywhere, yet contribute to a lot of the code base and a
>>>>> lot of awkwardness (such as being the one only exception not using
>>>>> volatiles for some reason). Moreover, store_fence is not used anywhere
>>>>> in hotspot because it is not a good fit for either the acquire/release
>>>>> semantics or the Java volatile semantics, leaving a big question mark on
>>>>> when it should ever be used. I took the liberty of removing it.
>>>>>
>>>>> Another redundancy issue is that most of the semantics is exactly the
>>>>> same for all platforms, yet all that default boilerplate such as how to
>>>>> make atomic accesses, where acquire/release is supposed to be placed
>>>>> w.r.t. the memory access, what the different barriers should do etc. is
>>>>> copied in redundantly for each os_cpu and each type of memory access for
>>>>> each os_cpu. This makes it extremely painful 1) to understand what
>>>>> actually defines a certain platform compared to the defaults and 2) to
>>>>> correct bugs like those discovered here 3) prevent silly mistakes and
>>>>> bugs, by simply having a lot less code defining the behaviour of
>>>>> OrderAccess that could go wrong.
>>>>>
>>>>> A new architecture/design for OrderAccess is proposed, using a
>>>>> generalization/specialization approach.
>>>>>
>>>>> A general implementation in /share/ defines how things work and splits
>>>>> into different concerns: 1) how to make an atomic memory access, 2)
>>>>> where to but barriers w.r.t. the memory access for things like
>>>>> load_acquire, release_store and release_store_fence, 3) how these
>>>>> barriers are defined.
>>>>>
>>>>> This allows a clear split between what is required for following the
>>>>> specifications, and optimizations, which become much more readable and
>>>>> only optimizations need to be reviewed in depth as the defaults can
>>>>> always be trusted given correct standalone barriers.
>>>>>
>>>>> The only thing a platform is required to specify, is what an
>>>>> implementation of acquire(), release() and fence() should do. If they
>>>>> are implemented properly, everything in OrderAccess is guaranteed to
>>>>> work according to specification using the generalized code. This makes
>>>>> it very easy to support new ports. ALL the other code in the os_cpu
>>>>> files is used /only/ for optimization purposes offered for specific
>>>>> configurations.
>>>>>
>>>>> However, it is highly customizable so that specific platform can perform
>>>>> any desired optimizations. For instance this load_acquire on PPC is
>>>>> optimized:
>>>>>
>>>>> template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte>
>>>>> (volatile jbyte*  p) { register jbyte t = load(p);
>>>>> inlasm_acquire_reg(t); return t; }
>>>>>
>>>>> This overrides the whole load_acquire implementation to do something
>>>>> custom. Platforms like x86 extensively use this for joined fencing
>>>>> variants to optimize.
>>>>>
>>>>> The default implementation of load_acquire() will make an atomic load()
>>>>> followed by acquire() since the semantics is generalized. The
>>>>> generalized semantics are defined using inlined postfix/prefix calls
>>>>> after/before the atomic access, as defined here:
>>>>>
>>>>> template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>>>>> OrderAccess::acquire(); }
>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        {
>>>>> OrderAccess::release(); }
>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>>>>> OrderAccess::release(); }
>>>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>>>>> OrderAccess::fence();   }
>>>>>
>>>>> For platforms that simply wish to override what e.g. acquire means for a
>>>>> joined ordered memory access in general, as different to calling stand
>>>>> alone acquire(), the semantics can be easily overridden for a platform
>>>>> such as windows like on windows:
>>>>>
>>>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>>>> OrderAccess::fence(); }
>>>>>
>>>>> In this example, since Windows (now) has a compiler barrier for acquire,
>>>>> but does not need it for joined accesses since volatile has stronger
>>>>> guarantees on windows, this is enough to specialize that for joined
>>>>> memory accesses, no extra protection is needed.
>>>>>
>>>>> == Backward Compatibility and Transitioning ==
>>>>>
>>>>> Since the newly proposed code is structured differently to before, a
>>>>> #define was added for backward compatibility so that external
>>>>> repositories not adhering to this new architecture do not break.
>>>>> Furthermore, store_release was declared private and marked as
>>>>> deprecated. This allows for a smooth transition into the new style of
>>>>> OrderAccess. When the full transition is made in all known repos, the
>>>>> #define and store_fence may be safely removed, eventually.
>>>>>
>>>>> == Documentation ==
>>>>>
>>>>> The documentation seems old and outdated, describing how it works on
>>>>> SPARC RMO and IA64, which are nowhere to be found in the repository. It
>>>>> also describes things about C++ volatiles which cannot be relied upon.
>>>>> The documentation has been cleaned up to match the current state of the
>>>>> implementation better, with architectures actually found in the repository.
>>>>>
>>>>> == Testing ==
>>>>>
>>>>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
>>>>> Ran some DaCapo benchmarks (I know okay :p) for performance regression
>>>>> and there was no perceivable difference.
>>>>>
>>>>> Looking forward to feedback on this, and hope to get some reviews. :)
>>>>>
>>>>> Thanks,
>>>>> Erik
>>>>
>>>>
>>
>