(XS) RFR: 8046210: Missing memory barrier when reading init_lock

Mon Sep 1 10:03:58 UTC 2014

Hi David,

thanks for posting this bug.

PPC64 processors indeed do such kind of load reordering across predicted branches.

load x, compare and branch depending on x, is sufficient to order this load with respect to succeeding stores,
but succeeding loads may get executed speculatively and reorder with the load x.

The cheapest way to order the load of x with respect to succeeding loads on PPC64 is:
load x, compare and trap or branch depending on x, then issue an isync instruction which prevents speculative loads.
This is what we implemented in OrderAccess::load_acquire (orderAccess_linux_ppc.inline.hpp).

Best regards,
Martin

-----Original Message-----
From: hotspot-runtime-dev [mailto:hotspot-runtime-dev-bounces at openjdk.java.net] On Behalf Of David Holmes
Sent: Montag, 1. September 2014 06:49
To: hotspot-runtime-dev at openjdk.java.net
Subject: (XS) RFR: 8046210: Missing memory barrier when reading init_lock

The bug for this is marked confidential so I won't bother with the link.

The init_lock is used to guard the initialization of class. It is 
created by the thread that will do the initialization and cleared (so it 
can be gc'd) once initialization is complete. Concurrent attempts to 
initialize the class will block on the init_lock, then after 
initialization is complete that will be reflected in the class's state 
and no thread will attempt to use the init_lock. Because there is an 
inherent race with updating the state such that the init_lock is not 
needed, and clearing the init_lock field, this has to be done under a 
strong memory barrier - which it is eg:

   this_k->set_init_state (fully_initialized);
   this_k->fence_and_clear_init_lock();	

where we have:

void InstanceKlass::fence_and_clear_init_lock() {
    // make sure previous stores are all done, notably the init_state.
    OrderAccess::storestore();
    java_lang_Class::set_init_lock(java_mirror(), NULL);
    assert(!is_not_initialized(), "class must be initialized now");
}

However we overlooked a potential problem with the code that reads the 
init_lock and the initialization state. Such code typically has this form:

  oop init_lock = this_k->init_lock();
  ObjectLocker ol(init_lock, THREAD, init_lock != NULL);
  <read initialization state>
  if (initialized) return;
  else do _initialization();

If this is called when initialization is already complete then init_lock 
will be NULL and so the ObjectLocker in fact is a noop, and we then read 
the initialization state, see it was initialized and return as there is 
nothing to do.

What this overlooked was that the compiler and/or hardware might reorder 
the loads of the init_lock and the state field, such that we read a 
state that indicates initialization is not complete, but we then read 
the init_lock as NULL because the initialization has now in fact 
completed. This can lead to different failure modes but most typically a 
crash because we will call waitUninterruptibly() on the init_lock object 
(which is in fact NULL). (Given the 'distance' between the two loads 
this was surprising, but we were seeing such crashes on 
relaxed-memory-ordering hardware.)

The fix is much simpler than this explanation :) simply add a loadload() 
barrier between the reading of init_lock and the reading of state. As 
this can happen in a few places it is internalized into the int_lock() 
accessor (just as the fence is internalized into 
fence_and_clear_init_lock().

webrev: http://cr.openjdk.java.net/~dholmes/8046210/webrev/

Dan Daugherty observed that as it is only the non-NULL value of 
init_lock that can potentially lead to a problem, the barrier could be 
conditional on that ie:

if (init_lock == NULL)
   OrderAccess::loadload();

I haven't done that because my suspicion is that the presence of the 
loadload(), even in a conditional branch, will actually effect the same 
barrier. For architectures where loadload() is simply a compiler-barrier 
(eg x86) I suspect the presence of the condition will prevent the 
compiler from doing a reorder as if the branch to the loadload() is 
followed then any optimistic load of the initialization state would have 
to be thrown away and a second load issued. Similar logic holds for the 
hardware case - will the hardware issue reordered loads based on branch 
prediction only to have to reissue the load if the prediction was wrong? 
I suspect not, but it is speculation on my part. I'd be interested to 
hear from any of our optimization experts on that count. It may also be 
that (for x86 at least) the branch logic would be more expensive than 
unconditionally doing the atomic store to a stack variable?

Thanks,
David