RFR: 8181085: Race condition in method resolution may produce spurious NullPointerException

Mon Jun 5 15:07:15 UTC 2017

On 05/06/17 12:52, Erik Österlund wrote:
> Hi Andrew,
>
> On 2017-06-03 11:00, Andrew Haley wrote:
>> On 30/05/17 11:57, Erik Österlund wrote:
>>
>>> The issue is not whether an algorithm depends on IRIW or not. The
>>> issue is that we have to explicitly reason about IRIW to prove that
>>> it works.  The lack of IRIW violates seq_cst and by extension
>>> linearization points that rely in seq_cst, and by extension
>>> algorithms that rely on linearization points. By breaking the very
>>> building blocks that were used to reason about algorithms and their
>>> correctness, we rely on chance for it to work...
>>
>> I've been thinking about this some more, and one thing you said has
>> been baffling me.  I don't think that I have come across anywhere in
>> HotSpot that uses seq_cst internally, and I don't think there is even
>> any support for it in OrderAccess.  The only thing that I know of
>> where we actually need seq.cst is in the C++ interpreter and JNI's
>> support for volatile fields, and there we simulate seq.cst by using
>> release_fence; store; full fence.
>
> Yes, I miss sequential consistency in hotspot too. In fact, my mouse pad
> has "sequential consistency <3" printed on it. :) For atomic operations
> such as CAS, we do use what we refer to as conservative memory ordering,
> which appears to be closely related to SC.
>
> Other than that, the main idea with SC is on an abstract level to have
> (at least) the guarantees of the weaker acquire/release as well as the
> effect of a full fence between any pair of SC accesses so that they can
> agree upon some total ordering of SC accesses (let's not dig too far
> into details and interactions between SC and acquire/release on the same
> atomic objects just yet). In hotspot, rather than having such semantics,
> we have instead elected to accomplish that either with interleaved
> OrderAccess::fence() calls where this is required. In the cases you
> mentioned, the fence is guarded with if statements checking whether
> "support_IRIW_for_not_multiple_copy_atomic_cpu" is set to control the
> convention of where the fence is placed.
>
> I believe and hope that if we would have had an SC memory ordering, it
> would have been used more.
>
>> But we could do that with the
>> seq.cst C++11 functions instead.
> Or build our own SC bindings.

True: I'd like us to be able to define what we need on a per-CPU basis
rather than defining things in the shared code, so that we really can
use the appropriate instructions everywhere, and every target does
what we need as well as it can.

In particular, ARM listened to our pleas about sequential consistency
and gave us what we asked for in v8, and I'd like to take advantage of
the results.

> In fact, my new Access API does come with sequential consistency as
> one memory ordering constraint. Hopefully that will not be shot
> down...

I don't believe that we should use SC where acq/rel is sufficient: the
efficiency hit on some hardware is not nice.

> The benefit of rolling our own SC mappings is that we control the ABI
> and conventions. Some immediate benefits are:
>
> 1) It is compliant with our dynamically generated code interacting with
> our runtime. For example, the most popular bindings for SC on
> non-multiple copy atomic CPUs seems to be either what is commonly
> referred to as "leading sync" or "trailing sync" conventions where a
> full fence is placed either before or after every SC access to ensure
> every two SC accesses would be interleaved with a full fence. Both the
> leading sync and trailing sync conventions were proven correct [1]. But
> subsequently, the proof was proven wrong and it turns out that the
> trailing sync bindings are not always safe as it may violate e.g. IRIW
> when weaker acquire accesses are combined with SC accesses on the same
> locations [2].

That doesn't surprise me.

> Yet the currently recommended ARMv7 bindings appear to be trailing
> sync [3], because it has been determined to be faster.  Recently,
> both trailing sync and leading sync bindings have been proposed that
> correct the currently known flaws of C++11 seq_cst [4].  These
> bindings are not necessarily compliant with lock-free races between
> our runtime and our dynamically generated code.
>
> 2) The trailing sync and leading sync conventions used by C++11
> compilers are not ABI compliant.

I don't understand what you mean by this.

> And there is no real way for us to know if a leading sync or
> trailing sync convention is used by the C++ compiler without full on
> selling our souls to uncontracted compiler internals. Any subsequent
> compiler upgrade might break the shady contract we thought we had
> with our dynamically generated code. This is a fundamental issue
> with relying on C++11 atomics.

OK.  But where we do know, we should be able to oveeride it in the
CPU-dependent code.

> 3) We can be compliant with our own JMM and not have to think about the
> interactions and differences between our JMM and the evolving C++11
> memory model when we write synchronized code.

Kinda sorta, but in the longer term we will really need to be able to
interwork with C++ code that uses atomics.

> 4) Rather than having trailing sync on some architectures and leading
> sync on others leading to observably different behaviour when combined
> with dynamically generated code, I would ideally like to have the same
> conventions on all machines so that any such behavioural difference for
> generated code looks the same on all platforms. This is not the case for
> C++11 atomics.

And I really would like this *not* to happen.  A honking great full
fence after every volatile store is not my idea of a good time.

> Currently recommended bindings use leading sync for PPC and trailing
> sync for ARMv7 [3]. And they will behave slightly differently.
>
> 5) The ARM manuals talk about "visibility" of stores as parts of its
> contract - a property that is too specific to fit into the C++11
> memory model, yet might be useful for a JVM that needs to interact
> with dynamically generated code. In particular, if some ARMv8 stlr
> instruction guarantees "visibility" (which I think can be better
> thought of as not reordering with any subsequent accesses in program
> order as a memory model should preferably talk about constraints
> between memory access orderings), then this is equivalent to always
> having a trailing fence() on SC stores,

But consider this:

volatile x, y

Thread 1      Thread 2

x = 1         y = 1
r1 = y        r1 = x

On the ARMv8 memory model, we can't do this as

mov r0, #1         mov r0, #1
stlr r0, [x]       stlr r0, [y]
ldr r1, [y]        ldr r1, [y]
dmb                dmb

In this case, clearly the DMB instructions have no effect.  In order
to get SC we need to match the STLR with LDAR.  STLR is not equivalent
to always having a trailing fence, but we can recover SC by using a
DMB before the load.

> and hence unlike e.g. a C++11 leading sync SC store binding for
> PPC. I would argue we do want this stronger property as our current
> generated code occasionally depends on that.
>
> 6) We can decide upon what properties we think are important
> independently of which direction C++ decides to go. They can e.g. decide
> to buy weaker consistency for a failing CAS in the name of micro
> optimization, and we can elect not to buy that bag of problems.

They already did.  And rightly so, IMO.

> Hope you bought my car sales man pitch.
>
>> Of course I have a motive for digging into this: I'm the lead of the
>> AArch64-port project.  That processor (uniquely?) has real support for
>> seq.cst, and it'd be nice to use it without throwing a full fence at
>> every volatile store.
>
> I understand your motivation. At least I think we both want some kind of
> SC semantics in hotspot in one shape or another. :)

Indeed, but not everywhere: much of the time acq/rel will do.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671