RFR: 8181085: Race condition in method resolution may produce spurious NullPointerException

Mon May 29 17:56:22 UTC 2017

Hi Volker,

Thank you for filling in more compiler info.

If there is a choice between providing a new compiler barrier interface and defining its semantics vs using existing volatile semantics, then volatile semantics seems better to me.

Also, my new Access API allows you to do BasicAccess<MO_VOLATILE>::load_oop(addr) to perform load_heap_oop and load_decode_heap_oop with volatile semantics. Sounds like that would help here.

Thanks,
/Erik

> On 29 May 2017, at 19:02, Volker Simonis <volker.simonis at gmail.com> wrote:
> 
> Hi Erik,
> 
> thanks for the nice summary. Just for the sake of completeness, here's
> the corresponding documentation for the xlc compiler barrier [1]. It
> kind of implements the gcc syntax, but the wording is slightly
> different:
> 
> "Add memory to the list of clobbered registers if assembler
> instructions can change a memory location in an unpredictable fashion.
> The memory clobber ensures that the data used after the completion of
> the assembly statement is valid and synchronized.
> However, the memory clobber can result in many unnecessary reloads,
> reducing the benefits of hardware prefetching. Thus, the memory
> clobber can impose a performance penalty and should be used with
> caution."
> 
> We haven't used it until now, so I can not say if it really does what
> it is supposed to do. I'm also concerned about the performance
> warning. It seems like the "unnecessary reloads" can really hurt on
> architectures like ppc which have much more registers than x86.
> Declaring a memory location 'volatile' seems much more simple and
> light-weight in order to achieve the desired effect. So I tend to
> agree with you and David that we should proceed to mark things with
> 'volatile'.
> 
> Sorry for constantly "spamming" this thread with another problem (i.e.
> JDK-8129440 [2]) but I still think that it is related and important.
> In its current state, the way how "load_heap_oop()" and its
> application works is broken. And this is not because of a problem in
> OrderAccess, but because of missing compiler barriers:
> 
> static inline oop       load_heap_oop(oop* p)       { return *p; }
> ...
> template <class T>
> inline void G1RootRegionScanClosure::do_oop_nv(T* p) {
>  // 1. load 'heap_oop' from 'p'
>  T heap_oop = oopDesc::load_heap_oop(p);
>  if (!oopDesc::is_null(heap_oop)) {
>    // 2. Compiler reloads 'heap_oop' from 'p' which may now be null!
>    oop obj = oopDesc::decode_heap_oop_not_null(heap_oop);
>    HeapRegion* hr = _g1h->heap_region_containing((HeapWord*) obj);
>    _cm->grayRoot(obj, hr);
>  }
> }
> 
> Notice that we don't need memory barriers here - all we need is to
> prevent the compiler from loading the oop (i.e. 'heap_oop') a second
> time. After Andrews explanation (thanks for that!) and Martin's
> examples from Google, I think we could fix this by rewriting
> 'load_heap_oop()' (and friends) as follows:
> 
> static inline oop load_heap_oop(oop* p) {
>  oop o = *p;
>  __asm__ volatile ("" : : : "memory");
>  return o;
> }
> 
> In order to make this consistent across all platforms, we would
> probably have to introduce a new, public "compiler barrier" function
> in OrderAccess (e.g. 'OrderAccess::compiler_barrier()' because we
> don't currently seem to have a cross-platform concept for
> "compiler-only barriers"). But I'm still not convinced that it would
> be better than simply writing (and that's the way how we've actually
> solved it internally):
> 
> static inline oop load_heap_oop(oop* p) { return * (volatile oop*) p; }
> 
> Declaring that single memory location to be 'volatile' seems to be a
> much more local change compared to globally "clobbering" all the
> memory. And it doesn't rely on a the compilers providing a compiler
> barrier. It does however rely on the compiler doing the "right thing"
> for volatile - but after all what has been said here so far, that
> seems more likely?
> 
> The problem may also depend on the specific compiler/cpu combination.
> For ppc64, both gcc (on linux) and xlc (on aix), do the right thing
> for volatile variables - they don't insert any memory barriers (i.e.
> no instructions) but just access the corresponding variables as if
> there was a compiler barrier. This is exactly what we currently want
> in HotSpot, because fine-grained control of memory barriers is
> controlled by the use of OrderAccess (and OrderAccess implies
> "compiler barrier", at least after the latest fixes).
> 
> Any thoughts? Should we introduce a cross-platform, "compiler-only
> barrier" or should we stick to using "volatile" for such cases?
> 
> Regards,
> Volker
> 
> [1] https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.0/com.ibm.xlc131.aix.doc/language_ref/asm.html
> [2] https://bugs.openjdk.java.net/browse/JDK-8129440
> 
> On Mon, May 29, 2017 at 2:20 PM, Erik Österlund
> <erik.osterlund at oracle.com> wrote:
>> Hi Andrew,
>> 
>> I just thought I'd put my opinions in here as I see I have been mentioned a
>> few times already.
>> 
>> First of all, I find using the volatile keyword on things that are involved
>> in lock-free protocols meaningful from a readability point of view. It
>> allows the reader of the code to see care is needed here.
>> 
>> About the compiler barriers - you are right. Volatile should indeed not be
>> necessary if the compiler barriers do everything right. The compiler should
>> not reorder things and it should not prevent reloading.
>> 
>> On windows we rely on the deprecated _ReadWriteBarrier(). According to MSDN,
>> it guarantees:
>> 
>> "The _ReadWriteBarrier intrinsic limits the compiler optimizations that can
>> remove or reorder memory accesses across the point of the call."
>> 
>> This should cut it.
>> 
>> The GCC memory clobber is defined as:
>> 
>> "The "memory" clobber tells the compiler that the assembly code performs
>> memory reads or writes to items other than those listed in the input and
>> output operands (for example, accessing the memory pointed to by one of the
>> input parameters). To ensure memory contains correct values, GCC may need to
>> flush specific register values to memory before executing the asm. Further,
>> the compiler does not assume that any values read from memory before an asm
>> remain unchanged after that asm; it reloads them as needed. Using the
>> "memory" clobber effectively forms a read/write memory barrier for the
>> compiler."
>> 
>> This seems to only guarantee values will not be re-ordered. But in the
>> documentation for ExtendedAsm it also states:
>> 
>> "You will also want to add the volatile keyword if the memory affected is
>> not listed in the inputs or outputs of the asm, as the `memory' clobber does
>> not count as a side-effect of the asm."
>> 
>> and
>> 
>> "The volatile keyword indicates that the instruction has important
>> side-effects. GCC will not delete a volatile asm if it is reachable. (The
>> instruction can still be deleted if GCC can prove that control-flow will
>> never reach the location of the instruction.) Note that even a volatile asm
>> instruction can be moved relative to other code, including across jump
>> instructions."
>> 
>> This is a bit vague, but seems to suggest that by making the asm statement
>> volatile and having a memory clobber, it definitely will not reload
>> variables. About not re-ordering non-volatile accesses, it shouldn't but it
>> is not quite clearly stated. I have never observed such a re-ordering across
>> a volatile memory clobber. But the semantics seem a bit vague.
>> 
>> As for clang, the closest to a definition of what it does I have seen is:
>> 
>> "A clobber constraint is indicated by a “~” prefix. A clobber does not
>> consume an input operand, nor generate an output. Clobbers cannot use any of
>> the general constraint code letters – they may use only explicit register
>> constraints, e.g. “~{eax}”. The one exception is that a clobber string of
>> “~{memory}” indicates that the assembly writes to arbitrary undeclared
>> memory locations – not only the memory pointed to by a declared indirect
>> output."
>> 
>> Apart from sweeping statements saying clang inline assembly is largely
>> compatible and working similar to GCC, I have not seen clear guarantees. And
>> then there are more compilers.
>> 
>> As a conclusion, by using volatile in addition to OrderAccess you rely on
>> standardized compiler semantics (at least for volatile-to-volatile
>> re-orderings and re-loading, but not for volatile-to-nonvolatile, but that's
>> another can of worms), and regrettably if you rely on OrderAccess memory
>> model doing what it says it will do, then it should indeed work without
>> volatile, but to make that work, OrderAccess relies on non-standardized
>> compiler-specific barriers. In practice it should work well on all our
>> supported compilers without volatile. And if it didn't, it would indeed be a
>> bug in OrderAccess that needs to be solved in OrderAccess.
>> 
>> Personally though, I am a helmet-on-synchronization kind of person, so I
>> would take precaution anyway and use volatile whenever possible, because 1)
>> it makes the code more readable, and 2) it provides one extra layer of
>> safety that is more standardized. It seems that over the years it has
>> happened multiple times that we assumed OrderAccess is bullet proof, and
>> then realized that it wasn't and observed a crash that would never have
>> happened if the code was written in a helmet-on-synchronization way. At
>> least that's how I feel about it.
>> 
>> Now one might argue that by using C++11 atomics that are standardized, all
>> these problems would go away as we would rely in standardized primitives and
>> then just trust the compiler. But then there could arise problems when the
>> C++ compiler decides to be less conservative than we want, e.g. by not doing
>> fence in sequentially consistent loads to optimize for non-multiple copy
>> atomic CPUs arguing that IRIW issues that violate sequential consistency are
>> non-issues in practice. That makes those loads "almost" sequentially
>> consistent, which might be good enough. But it feels good to have a choice
>> here to be more conservative. To have the synchronization helmet on.
>> 
>> Meta summary:
>> 1) Current OrderAccess without volatile:
>>  - should work, but relies on compiler-specific not standardized and
>> sometimes poorly documented compiler barriers.
>> 
>> 2) Current OrderAccess with volatile:
>>  - relies on standardized volatile semantics to guarantee compiler
>> reordering and reloading issues do not occur.
>> 
>> 3) C++11 Atomic backend for OrderAccess
>>  - relies on standardized semantics to guarantee compiler and hardware
>> reordering issues
>>  - nevertheless isn't always flawless, and when it isn't, it gets painful
>> 
>> Hope this sheds some light on the trade-offs.
>> 
>> Thanks,
>> /Erik
>> 
>> 
>>> On 2017-05-28 10:45, Andrew Haley wrote:
>>> 
>>>> On 27/05/17 10:10, Volker Simonis wrote:
>>>> 
>>>>> On Fri, May 26, 2017 at 6:09 PM, Andrew Haley <aph at redhat.com> wrote:
>>>>> 
>>>>>> On 26/05/17 17:03, Volker Simonis wrote:
>>>>>> 
>>>>>> Volatile not only prevents reordering by the compiler. It also
>>>>>> prevents other, otherwise legal transformations/optimizations (like
>>>>>> for example reloading a variable [1]) which have to be prevented in
>>>>>> order to write correct, lock free programs.
>>>>> 
>>>>> Yes, but so do compiler barriers.
>>>> 
>>>> Please correct me if I'm wrong, but I thought "compiler barriers" are
>>>> to prevent reordering by the compiler. However, this is a question of
>>>> optimization. If you have two subsequent loads from the same address,
>>>> the compiler is free to do only the first load and keep the value in a
>>>> register if the address is not pointing to a volatile value.
>>> 
>>> No it isn't: that is precisely what a compiler barrier prevents.  A
>>> compiler barrier (from the POV of the compiler) clobbers all of
>>> the memory state.  Neither reads nor writes may move past a compiler
>>> barrier.
>>> 
>>