[aarch64-port-dev ] DMB elimination in C2 synchronization implementation

Tue Sep 1 15:06:44 UTC 2015

Hi All,

  We investigated aarch64 C2 synchronization implementation recently, and
found some space to improve, please take a look.

The attached is patch and some supporting figures.

In current aarch64 platform C2 compiler, it inserts DMB instruction after
lock acquire and before lock release

(see Figure LockingPaths.jpg attached, GraphKit::shared_lock and
GraphKit::shared_unlock) as a barrier to prevent LD/ST floating out of
critical region.

One DMB post-dominates biased lock, thin lock, and inflated lock
acquisition blocks before entering the critical region,

and another DMB dominates all the successor blocks of the critical region.
In some paths, the DMB is un-necessary and impact performance.

A proposed patch is attached to remove the un-necessary DMBs while keeping
the safety accessing the critical region.

First of all, we think a CompareAndSwap implemented using
load-acquire/store-release like following (already implemented in
MacroAssembler::cmpxchgptr),

when used to acquire or release a lock, is sufficient to be a barrier
instead of an explicit DMB. Sure please help review this point.

cmpxchgptr (oldv, newv, addr) sequence：

L_retry:

        ldaxr tmp, addr

        cmp  tmp, oldv

        bne  L_nope

        stlxr tmp, newv, addr

        cbzw tmp, L_succeed

        b L_retry

L_nope:

Similar code snippet can be found in ARM® Architecture Reference Manual
(DDI0487A_g, J10.3.1 Accquiring a lock).

The attached patch removes the DMBs surrounding the critical region. Then
we ensure each path entering/leaving a

critical region is protected by load-acquire/store-release:

*Path1 & 2 - biased locking/unlocking*

Locking:

Path 1 - When the lock object is biased to current thread, DMB is
un-necessary as current thread is holding the lock.

Path 2 - When the lock object is not biased to current thread, rebias takes
place:

If UseOptoBiasInlining is true, rebias is implemented with
StoreXConditional, which is mapped to aarch64_enc_cmpxchg in aarch64.ad
 file.

Instruction ldxr used in CompareAndSwap sequence in aarch64_enc_cmpxchg has
no barrier effect so we create a new aarch64_enc_cmpxchg_acq

The change is same as DMB patch submitted recently (
http://cr.openjdk.java.net/~adinn/8080293/webrev.00/) to replace ldxr with
ldaxr serving

as a barrier in CompareAndSwap sequence.

If UseOptoBiasInlining is false, MacroAssembler::biased_locking_enter is
invoked to acquire lock. It already has load-acquire/store-release

and is safe without explicit DMB.

Unlocking:

There is no-op in biased unlocking, so no special handling is needed.

*Path 3 – Thin lock/unlock*

Locking:

Thin lock acquire is implemented in aarch64_enc_fast_lock, it uses a simple
CAS sequence without generating any barrier. It depends on DMB barrier
generated

by membar_acquire_lock inserted in GraphKit::shared_lock. As described
above, load-acquire/store-release pair is sufficient to serve as a barrier

instead of an explicit DMB, so we suggest using ldaxr-stlxr pair as
following code shows:

L_retry:

        ldaxr r1, obj->markOOP  // change ldxr to ldaxr

        cmp r1, unlocked_markword

        bne thin_lock_fail

        stlxr tmp, lock_record_address, obj->markOOP

        cbzw  tmp, L_cont

        b  L_retry

L_cont

Unlocking:

Thin lock release is implemented in aarch64_enc_fast_unlock as the first
code snippet shows. We think ldxr-stlxr pair is enough for locking release
and no special handling is needed after removing DMB.

 L_retry:

        ldxr r1, obj->markOOP

        cmp   r1, lock_record_address

        bne   thin_lock_fail

        stlxr disp_hdr, obj->markOOP

        bne   L_retry

*Path 4 ObjectMonitor lock/unlock*

Locking:

In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_locking_C. Base on our investigation, we
find all lock

acquire operation is achieved by Atomic::cmpxchg_ptr which calls native
function __sync_val_compare_and_swap extended to following code. The
load-acquire/store-release

pair is sufficient to serve as barrier.

a)  Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap

.L3:

        ldaxr   x0, [x1]

        cmp     x0, x2

        bne     .L4

        stlxr   w4, x3, [x1]

        cbnz    w4, .L3

.L4:

Unlocking:

In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_unlocking_C. All release operation is
achieved by Atomic::cmpxchg_ptr

and OrderAccess::release_store_ptr + OrderAccess::storeload.
Atomic::cmpxchg_ptr has been mentioned above. On aarch64 platform,
OrderAccess::release_store_ptr + OrderAccess::storeload are mapped to stlr
and DMB instructions. Those two are enough to serve as barrier during lock
release. You can refer to attached for flow graph.

a)  Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap

       Same as above

b)  OrderAccess::release_store_ptr calls __atomic_store

       stlr    x1, [x0]

c)  OrderAccess::storeload() is same with OrderAccess::fence(), they calls
__sync_synchronize

       dmb     ish