Single byte Atomic::cmpxchg implementation

Wed Sep 10 15:14:29 UTC 2014

Hi David,

Thank you for your reply. So with ARM out of the window, I thought I'd have a look at the PPC/AIX implementation. Looks like the loops here can be flattened to 1 loop instead of 2 nested.

On a more crucial note, I was very surprised to see two full sync instructions were emitted in the CAS. I believe these are over conservative fences in the current implementation. I would like to replace the write fence with lwsync instead of sync and the read fence with isync instead of sync. Is there any good reason why it was implemented in this (in my opinion) over-conservative and expensive way?

inline jbyte Atomic::cmpxchg(jbyte exchange_value, volatile jbyte* dest, jbyte compare_value) {
  unsigned int old_value;
  const uint64_t zero = 0;

  __asm__ __volatile__ (
    /* fence */
    strasm_sync					// <------------------ write fence (release) should be strasm_lwsync instead as we don't care about ordering of device memory
    /* simple guard */
    "   lwz     %[old_value], 0(%[dest])                \n"
    "   cmpw    %[compare_value], %[old_value]          \n"
    "   bne-    2f                                      \n"
    /* atomic loop */
    "1:                                                 \n"
    "   lwarx   %[old_value], %[dest], %[zero]          \n"
    "   cmpw    %[compare_value], %[old_value]          \n"
    "   bne-    2f                                      \n"
    "   stwcx.  %[exchange_value], %[dest], %[zero]     \n"
    "   bne-    1b                                      \n"
    /* acquire */
    strasm_sync					// <------------------ read fence (acquire) should be starsm_isync instead as we don't care about ordering of device memory
    /* exit */
    "2:                                                 \n"
    /* out */
    : [old_value]       "=&r"   (old_value),
                        "=m"    (*dest)
    /* in */
    : [dest]            "b"     (dest),
      [zero]            "r"     (zero),
      [compare_value]   "r"     (compare_value),
      [exchange_value]  "r"     (exchange_value),
                        "m"     (*dest)
    /* clobber */
    : "cc",
      "memory"
    );

  return (jint) old_value;
}

If nobody minds, I'd like to change this. :)

/Erik

On 08 Sep 2014, at 04:11, David Holmes <david.holmes at oracle.com> wrote:

> Hi Erik,
> 
> Note there is currently no ARM code in the OpenJDK itself. Of course the Aarch64 project will hopefully be changing that soon, but I would not think they need the logic you describe below.
> 
> Cheers,
> David
> 
> On 6/09/2014 12:03 AM, Erik Österlund wrote:
>> Hi Mikael,
>> 
>> Back from travelling now. I did look into other architectures a bit and made some interesting findings.
>> 
>> The architecture that stands out the most disastrous to me is ARM. It has three levels of nested loops to carry out a single byte CAS:
>> 1. Outmost loop to emulate byte-grain CAS using word-sized CAS.
>> 2. Middle loop makes calls to the __kernel_cmpxchg which is optimized for non-SMP systems using OS support but backward compatible with LL/SC loop for SMP systems. Unfortunately it returns a boolean (success/failure) rather than the destination value and hence the loop keeps track of the actual value at the destination required by the Atomic::cmpxchg interface.
>> 3. __kernel_cmpxchg implements CAS on SMP-systems using LL/SC (ldrex/strex). Since a context switch can break in the middle, a loop retries the operation in such unfortunate spuriously failing scenario.
>> 
>> I have made a new solution that would only make sense on ARMv6 and above with SMP. The proposed solution has only one loop instead of three, would be great if somebody could review it:
>> 
>> inline intptr_t __casb_internal(volatile intptr_t *ptr, intptr_t compare, intptr_t new_val) {
>>    intptr_t result, old_tmp;
>> 
>>    // prefetch for writing and barrier
>>    __asm__ __volatile__ ("pld [%0]\n\t"
>>                          "        dmb     sy\n\t" /* maybe we can get away with dsb st here instead for speed? anyone? playing it safe now */
>>                          :
>>                          : "r" (ptr)
>>                          : "memory");
>> 
>>    do {
>>        // spuriously failing CAS loop keeping track of value
>>        __asm__ __volatile__("@ __cmpxchgb\n\t"
>>                     "        ldrexb  %1, [%2]\n\t"
>>                     "        mov     %0, #0\n\t"
>>                     "        teq     %1, %3\n\t"
>>                     "        it      eq\n\t"
>>                     "        strexbeq %0, %4, [%2]\n\t"
>>                     : "=&r" (result), "=&r" (old_tmp)
>>                     : "r" (ptr), "Ir" (compare), "r" (new_val)
>>                     : "memory", "cc");
>>    } while (result);
>> 
>>    // barrier
>>    __asm__ __volatile__ ("dmb sy"
>>                          ::: "memory");
>> 
>>    return old_tmp;
>> }
>> 
>> inline jbyte    Atomic::cmpxchg    (jbyte    exchange_value, volatile jbyte*    dest, jbyte    compare_value) {
>>    return (jbyte)__casb_internal(volatile jbyte*)ptr, (intptr_t)compare, (intptr_t)new_val);
>> }
>> 
>> What I'm a bit uncertain about here is which barriers we need and which are optimal as it seems to be a bit different for different ARM versions, maybe somebody can enlighten me? Also I'm not sure how hotspot checks ARM version to make the appropriate decision.
>> 
>> The proposed x86 implementation is much more straight forward (bsd, linux):
>> 
>> inline jbyte Atomic::cmpxchg(jbyte exchange_value, volatile jbyte* dest, jbyte compare_value) {
>>  int mp = os::is_MP();
>>  jbyte result;
>>  __asm__ volatile (LOCK_IF_MP(%4) "cmpxchgb %1,(%3)"
>>                    : "=a" (result)
>>                    : "q" (exchange_value), "a" (compare_value), "r" (dest), "r" (mp)
>>                    : "cc", "memory");
>>  return result;
>> }
>> 
>> Unfortunately the code is spread out through a billion files because of different ABIs and compiler support for different OS variants. Some use generated stubs, some use ASM files, some use inline assembly. I think I fixed all of them but I need your help to build and verify it if you don't mind as I don't have access to those platforms. How do we best do this?
>> 
>> As for SPARC I unfortunately decided to keep the old implementation as SPARC does not seem to support byte-wide CAS, only found the cas and casx instructions which is not sufficient as far as I could tell, corrections if I'm wrong? In that case, add byte-wide CAS on SPARC to my wish list for christmas.
>> 
>> Is there any other platform/architecture of interest on your wish list I should investigate which is important to you? PPC?
>> 
>> /Erik
>> 
>> On 04 Sep 2014, at 11:20, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
>> 
>>> Hi Erik,
>>> 
>>> On Thursday 04 September 2014 09.05.13 Erik Österlund wrote:
>>>> Hi,
>>>> 
>>>> The implementation of single byte Atomic::cmpxchg on x86 (and all other
>>>> platforms) emulates the single byte cmpxchgb instruction using a loop of
>>>> jint-sized load and cmpxchgl and code to dynamically align the destination
>>>> address.
>>>> 
>>>> This code is used for GC-code related to remembered sets currently.
>>>> 
>>>> I have the changes on my platform (amd64, bsd) to simply use the native
>>>> cmpxchgb instead but could provide a patch fixing this unnecessary
>>>> performance glitch for all supported x86 if anybody wants this?
>>> 
>>> I think that sounds good.
>>> Would you mind looking at other cpu arches to see if they provide something
>>> similar? It's ok if you can't build the code for the other arches, I can help
>>> you with that.
>>> 
>>> /Mikael
>>> 
>>>> 
>>>> /Erik
>>> 
>>