[aarch64-port-dev ] barrier issue in cmpxchgptr implementation

Tue Jun 30 14:34:12 UTC 2015

Hi All,
  I checked the MacroAssembler::cmpxchgptr implementation in JVM, and found load-aquire/store-release followed by a full memory barrier used to simulate the behavior of X86 cmpxchg.
Following is the code snapshot from the function for your reference. In my opinion, the cmpxchg must provide full bi-directional fence semantics. Is there anyone can help to explain why one
membar in the end is enough for cmpxchg? Or one more memory barrier commented out is needed to add at the beginning?

  // membar(AnyAny)  // is this needed?
retry_load:
ldaxr(tmp, addr);
  cmp(tmp, oldv);
  br(Assembler::NE, nope);
  stlxr(tmp, newv, addr);
  cbzw(tmp, succeed);
  b(retry_load);
nope:
  membar(AnyAny);
  mov(oldv, tmp);

I checked the source __cmpxchg_mb, it seems two memory barrier is added around cmpxchg.

smp_mb();
ret = __cmpxchg(ptr, old, new, size);
smp_mb();

But after checked intrinsic in GCC with following simple case, I found there is no barrier around load-acquire/store-release.
I am a little confused now. Which instruction sequence should be chosen to simulate X86 cmpxchg?

long foo (long *ptr, long old, long new)
{
    return __sync_val_compare_and_swap (ptr, old, new);
}
Assembly:
        ldaxr   x3, [x0]        // 22   aarch64_load_exclusivedi        [length = 4]
        cmp     x3, x1  // 23   *cmpdi/1        [length = 4]
        bne     .L3     // 24   *condjump       [length = 4]
        stlxr   w4, x2, [x0]    // 25   aarch64_store_exclusivedi       [length = 4]
        cbnz    w4, .L2 // 26   *cbnesi1        [length = 4]

Regards!
wei