RFR: 8181085: Race condition in method resolution may produce spurious NullPointerException

Tue May 30 10:57:52 UTC 2017

On 2017-05-30 10:50, Andrew Haley wrote:
> On 29/05/17 13:20, Erik Österlund wrote:
>
>> As a conclusion, by using volatile in addition to OrderAccess you
>> rely on standardized compiler semantics (at least for
>> volatile-to-volatile re-orderings and re-loading, but not for
>> volatile-to-nonvolatile, but that's another can of worms), and
>> regrettably if you rely on OrderAccess memory model doing what it
>> says it will do, then it should indeed work without volatile, but to
>> make that work, OrderAccess relies on non-standardized
>> compiler-specific barriers. In practice it should work well on all
>> our supported compilers without volatile. And if it didn't, it would
>> indeed be a bug in OrderAccess that needs to be solved in
>> OrderAccess.
> Right.  And, target by target, we can rework OrderAccess to use real
> C++11 atomics, and everything will be better.  Eventually.

I do not completely disagree, but I see drawbacks with that too. I am 
not convinced C++11 is a silver bullet.
Note that we lose some explicit control that might end up biting us. And 
when it does, it will be even harder to detect as we have sold ourselves 
to the C++11 atomic silver bullet, abstracting away the generated code.

For example, C++11 atomic accesses were designed to play nicely with 
other C++11 atomic accesses. Both the load-side and store-side have to 
look in very specific ways for e.g. the seq_cst semantics to hold. For 
example depending if you want seq_cst to have IRIW constraints or not, 
some PPC compiler could choose to have the sync instruction on either 
the load side or the store side. Since all seq_cst accesses are 
controlled by C++11 and go through their compiler, they can make that 
choice as the accesses stay inside of their "ABI". But the choice needs 
to be consistent with the choice we make in the JVM and our hand crafted 
assembly. That is, our hand crafted assembly code has to go by the same 
"ABI". And we can no longer guarantee it does as we have lost control 
over what instructions are generated.

One concrete example that comes to mind is the JNIFastGetField 
optimization on ARMv7.
The memory model of ARMv7 does not respect causality between loads and 
stores. Therefore, in theory (and maybe in practice), problems can arise 
when three threads are involved in a synchronization dance where 
consistent causality chains are assumed.

In the JNIFastGetField optimization we do the following with hand coded 
assembly:
1) load safepoint counter (written by VM thread)
2) speculatively load primitive value from object (possibly clobbered by 
a GC thread)
3) load safepoint counter again (written by VM thread) and check it did 
not change

These loads are all normal loads in hand coded assembly.

Now for this synchronization to work, it is assumed that if the store to 
the safepoint counter observed at 1 happens-before the store observed by 
the speculatively read primitive value from 2). Due to the lack of 
causality in the memory model, this is explicitly not guaranteed to hold 
with normal loads and stores on ARMv7, and hence unless we had proper 
synchronization in the runtime, we could observe clobbered values from 
these optimized JNI getters and think they are okay. But since our 
OrderAccess::fence translates to dmb sy specifically (which is 
conservative), the store will bubble up to the top level of the 
hierarchical memory model of ARMv7, and therefore we can break the 
pathological causality chain issue in the JVM by issuing 
OrderAccess::fence when storing the safepoint counter values. That way, 
our hand crafted assembly will work with normal loads in the fast path. 
If OrderAccess::fence translated to anything else than dmb sy, this 
would break. So if we went with the proposed C++11 mappings from 
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html then e.g. a 
seq_cst store on ARMv7 would translate into dmb ish; str; dmb ish, and 
that would not suffice to break the causality chain.

Of course, it might be that the OS layer saved us from the above 
pathology anyway, either intentionally or unintentionally, but that is 
besides the point I am trying to make. The point is that the hand coded 
assembly and other dynamically generated code has to emit accesses that 
are compatible with the machine code generated by OrderAccess in the 
runtime. And when we give that control away to C++11 and abstract away 
the generated machine code, things might go horribly wrong in the most 
unexpected and obscure ways that I think would be a nightmare to debug.

Having said that, I am not convinced C++11 is not a good idea either. I 
would just like to balance out the view that C++11 is a synchronization 
silver bullet for the JVM that is simply a superior solution without any 
pitfalls and that doing anything else is wrong. There are things to be 
considered there as well, like the extent of possible ABI incompatibilities.

> It's important that we do so because racy accesses are undefined
> behaviour in C++11.  (And, arguably, before that, but I'm not going to
> go there.)

What paragraph are we referring to here that would break OrderAccess in 
C++11?

>> Personally though, I am a helmet-on-synchronization kind of person,
>> so I would take precaution anyway and use volatile whenever
>> possible, because 1) it makes the code more readable, and 2) it
>> provides one extra layer of safety that is more standardized. It
>> seems that over the years it has happened multiple times that we
>> assumed OrderAccess is bullet proof, and then realized that it
>> wasn't and observed a crash that would never have happened if the
>> code was written in a helmet-on-synchronization way. At least that's
>> how I feel about it.
> I have no problem with that.  What I *do* have a problem with is the
> use of volatile to fix bugs that really need to be corrected with
> proper barriers.

I think you misunderstood me here. I did not propose to use volatile so 
we don't have to fix bugs in OrderAccess. Conversely, I said if we find 
such issues, we should definitely fix them in OrderAccess. But despite 
that, I personally like the pragmatic safety approach, and would use 
volatile in my lock-free code anyway to make it a) more readable, and b) 
provide an extra level of safety against our increasingly aggressive 
compilers. It's like wearing a helmet when biking. You don't expect to 
fall and should not fall, but why take risks if you don't have to and 
there is an easy way of preventing disaster if that happens. At least 
that's how I think about it myself.

>> Now one might argue that by using C++11 atomics that are
>> standardized, all these problems would go away as we would rely in
>> standardized primitives and then just trust the compiler.
> And I absolutely do argue that.  In fact, it is the only correct way
> to go with C++11 compilers.  IMO.

Not entirely convinced that statement is true as I think I mentioned 
previously.

>> But then there could arise problems when the C++ compiler decides to
>> be less conservative than we want, e.g. by not doing fence in
>> sequentially consistent loads to optimize for non-multiple copy
>> atomic CPUs arguing that IRIW issues that violate sequential
>> consistency are non-issues in practice.
> A C++ compiler will not decide to do that.  C++ compiler authors know
> well enough what sequential consistency means.  Besides, if there is
> any idiom in the JVM that actually requires IRIW we should remove it
> as soon as possible.

With the risk of going slightly off topic... C++ compiler authors have 
indeed done that in the past. And I have a huge problem with this. I 
think the exposed model semantics need to be respected. If the model 
says seq_cst, then the generated code should be seq_cst and not "almost" 
seq_cst. You can't expect users of a memory model to have to know that 
it intentionally (rather than accidentally) violates the guarantees 
because it was considered close enough and that nobody should be able to 
observe the difference in practice. (don't get me started on that one)

The issue is not whether an algorithm depends on IRIW or not. The issue 
is that we have to explicitly reason about IRIW to prove that it works. 
The lack of IRIW violates seq_cst and by extension linearizaiton points 
that rely in seq_cst, and by extension algorithms that rely on 
linearization points. By breaking the very building blocks that were 
used to reason about algorithms and their correctness, we rely on chance 
for it to work. The algorithm may or may not work. It probably does work 
without IRIW constraints in the vast majority of cases. But we have to 
explicitly reason about that expanded state machine of possible races 
caused by IRIW issues to actually know that it works rather than leaving 
it to chance. Reasoning about this extended state machine can take a lot 
of work and puts the bar unreasonably high for writing synchronized code 
in my opinion. And I think the alternative of leaving it to chance 
(albeit with good odds) seems like an unfortunate choice.

>> That makes those loads "almost" sequentially consistent, which might
>> be good enough. But it feels good to have a choice here to be more
>> conservative.  To have the synchronization helmet on.
> I have no real problem with that.  Using volatile has the problem,
> from my point of view, that it might conceal bugs that would be
> revealed on a weakly-ordered machine that you or I then have to fix,
> but I can live with it.

I do not see how using volatile has anything to do with weakly ordered 
machines. We use it where it is compiler reorderings specifically that 
need to be prevented.
If it is not just a compiler reordering that needed to be prevented, 
then of course the use of volatile is incorrect and a bug.

Either way, relying on C++11 atomics might also conceal bugs that would 
be revealed on a weakly-ordered machine due to conflicting ABIs between 
the statically generated C++ code and the dynamically generated code, as 
previously mentioned.

Thanks,
/Erik