(XS) RFR: 8046210: Missing memory barrier when reading init_lock

Mon Sep 1 12:20:09 UTC 2014

Hi David,

I wouldn't expect a performance problem here. So I think we're ok with the loadload() barrier.

If the case in which the barrier is needed is rare, the branch costs are expected to be lower than the lwsync (which gets executed by loadload) on PPC64.
However, the branch is probably not beneficial on total store order platforms. It depends on whether the branch unit or the load/store units are more busy.
Maybe somebody else knows more about x86.

Best regards,
Martin

-----Original Message-----
From: David Holmes [mailto:david.holmes at oracle.com] 
Sent: Montag, 1. September 2014 13:17
To: Doerr, Martin; hotspot-runtime-dev at openjdk.java.net
Cc: Volker Simonis
Subject: Re: (XS) RFR: 8046210: Missing memory barrier when reading init_lock

Hi Martin,

On 1/09/2014 8:03 PM, Doerr, Martin wrote:
> Hi David,
>
> thanks for posting this bug.
>
> PPC64 processors indeed do such kind of load reordering across predicted branches.
>
> load x, compare and branch depending on x, is sufficient to order this load with respect to succeeding stores,
> but succeeding loads may get executed speculatively and reorder with the load x.
>
> The cheapest way to order the load of x with respect to succeeding loads on PPC64 is:
> load x, compare and trap or branch depending on x, then issue an isync instruction which prevents speculative loads.
> This is what we implemented in OrderAccess::load_acquire (orderAccess_linux_ppc.inline.hpp).

Yes we know this pattern too. In this case putting a load_acquire at the 
point of the actual load is a bit more awkward, hence the separation. Do 
you have concerns about issuing a separate loadload() barrier here?

Thanks,
David

> Best regards,
> Martin
>
>
> -----Original Message-----
> From: hotspot-runtime-dev [mailto:hotspot-runtime-dev-bounces at openjdk.java.net] On Behalf Of David Holmes
> Sent: Montag, 1. September 2014 06:49
> To: hotspot-runtime-dev at openjdk.java.net
> Subject: (XS) RFR: 8046210: Missing memory barrier when reading init_lock
>
> The bug for this is marked confidential so I won't bother with the link.
>
> The init_lock is used to guard the initialization of class. It is
> created by the thread that will do the initialization and cleared (so it
> can be gc'd) once initialization is complete. Concurrent attempts to
> initialize the class will block on the init_lock, then after
> initialization is complete that will be reflected in the class's state
> and no thread will attempt to use the init_lock. Because there is an
> inherent race with updating the state such that the init_lock is not
> needed, and clearing the init_lock field, this has to be done under a
> strong memory barrier - which it is eg:
>
>     this_k->set_init_state (fully_initialized);
>     this_k->fence_and_clear_init_lock();	
>
> where we have:
>
> void InstanceKlass::fence_and_clear_init_lock() {
>      // make sure previous stores are all done, notably the init_state.
>      OrderAccess::storestore();
>      java_lang_Class::set_init_lock(java_mirror(), NULL);
>      assert(!is_not_initialized(), "class must be initialized now");
> }
>
> However we overlooked a potential problem with the code that reads the
> init_lock and the initialization state. Such code typically has this form:
>
>    oop init_lock = this_k->init_lock();
>    ObjectLocker ol(init_lock, THREAD, init_lock != NULL);
>    <read initialization state>
>    if (initialized) return;
>    else do _initialization();
>
> If this is called when initialization is already complete then init_lock
> will be NULL and so the ObjectLocker in fact is a noop, and we then read
> the initialization state, see it was initialized and return as there is
> nothing to do.
>
> What this overlooked was that the compiler and/or hardware might reorder
> the loads of the init_lock and the state field, such that we read a
> state that indicates initialization is not complete, but we then read
> the init_lock as NULL because the initialization has now in fact
> completed. This can lead to different failure modes but most typically a
> crash because we will call waitUninterruptibly() on the init_lock object
> (which is in fact NULL). (Given the 'distance' between the two loads
> this was surprising, but we were seeing such crashes on
> relaxed-memory-ordering hardware.)
>
> The fix is much simpler than this explanation :) simply add a loadload()
> barrier between the reading of init_lock and the reading of state. As
> this can happen in a few places it is internalized into the int_lock()
> accessor (just as the fence is internalized into
> fence_and_clear_init_lock().
>
> webrev: http://cr.openjdk.java.net/~dholmes/8046210/webrev/
>
> Dan Daugherty observed that as it is only the non-NULL value of
> init_lock that can potentially lead to a problem, the barrier could be
> conditional on that ie:
>
> if (init_lock == NULL)
>     OrderAccess::loadload();
>
> I haven't done that because my suspicion is that the presence of the
> loadload(), even in a conditional branch, will actually effect the same
> barrier. For architectures where loadload() is simply a compiler-barrier
> (eg x86) I suspect the presence of the condition will prevent the
> compiler from doing a reorder as if the branch to the loadload() is
> followed then any optimistic load of the initialization state would have
> to be thrown away and a second load issued. Similar logic holds for the
> hardware case - will the hardware issue reordered loads based on branch
> prediction only to have to reissue the load if the prediction was wrong?
> I suspect not, but it is speculation on my part. I'd be interested to
> hear from any of our optimization experts on that count. It may also be
> that (for x86 at least) the branch logic would be more expensive than
> unconditionally doing the atomic store to a stack variable?
>
> Thanks,
> David
>