Single byte Atomic::cmpxchg implementation

Fri Sep 5 14:03:31 UTC 2014

Hi Mikael,

Back from travelling now. I did look into other architectures a bit and made some interesting findings.

The architecture that stands out the most disastrous to me is ARM. It has three levels of nested loops to carry out a single byte CAS:
1. Outmost loop to emulate byte-grain CAS using word-sized CAS.
2. Middle loop makes calls to the __kernel_cmpxchg which is optimized for non-SMP systems using OS support but backward compatible with LL/SC loop for SMP systems. Unfortunately it returns a boolean (success/failure) rather than the destination value and hence the loop keeps track of the actual value at the destination required by the Atomic::cmpxchg interface.
3. __kernel_cmpxchg implements CAS on SMP-systems using LL/SC (ldrex/strex). Since a context switch can break in the middle, a loop retries the operation in such unfortunate spuriously failing scenario.

I have made a new solution that would only make sense on ARMv6 and above with SMP. The proposed solution has only one loop instead of three, would be great if somebody could review it:

inline intptr_t __casb_internal(volatile intptr_t *ptr, intptr_t compare, intptr_t new_val) {
   intptr_t result, old_tmp;

   // prefetch for writing and barrier
   __asm__ __volatile__ ("pld [%0]\n\t"
                         "        dmb     sy\n\t" /* maybe we can get away with dsb st here instead for speed? anyone? playing it safe now */
                         :
                         : "r" (ptr)
                         : "memory");

   do {
       // spuriously failing CAS loop keeping track of value
       __asm__ __volatile__("@ __cmpxchgb\n\t"
                    "        ldrexb  %1, [%2]\n\t"
                    "        mov     %0, #0\n\t"
                    "        teq     %1, %3\n\t"
                    "        it      eq\n\t"
                    "        strexbeq %0, %4, [%2]\n\t"
                    : "=&r" (result), "=&r" (old_tmp)
                    : "r" (ptr), "Ir" (compare), "r" (new_val)
                    : "memory", "cc");
   } while (result);

   // barrier
   __asm__ __volatile__ ("dmb sy"
                         ::: "memory");

   return old_tmp;
}

inline jbyte    Atomic::cmpxchg    (jbyte    exchange_value, volatile jbyte*    dest, jbyte    compare_value) {
   return (jbyte)__casb_internal(volatile jbyte*)ptr, (intptr_t)compare, (intptr_t)new_val);
}

What I'm a bit uncertain about here is which barriers we need and which are optimal as it seems to be a bit different for different ARM versions, maybe somebody can enlighten me? Also I'm not sure how hotspot checks ARM version to make the appropriate decision.

The proposed x86 implementation is much more straight forward (bsd, linux):

inline jbyte Atomic::cmpxchg(jbyte exchange_value, volatile jbyte* dest, jbyte compare_value) {
 int mp = os::is_MP();
 jbyte result;
 __asm__ volatile (LOCK_IF_MP(%4) "cmpxchgb %1,(%3)"
                   : "=a" (result)
                   : "q" (exchange_value), "a" (compare_value), "r" (dest), "r" (mp)
                   : "cc", "memory");
 return result;
}

Unfortunately the code is spread out through a billion files because of different ABIs and compiler support for different OS variants. Some use generated stubs, some use ASM files, some use inline assembly. I think I fixed all of them but I need your help to build and verify it if you don't mind as I don't have access to those platforms. How do we best do this?

As for SPARC I unfortunately decided to keep the old implementation as SPARC does not seem to support byte-wide CAS, only found the cas and casx instructions which is not sufficient as far as I could tell, corrections if I'm wrong? In that case, add byte-wide CAS on SPARC to my wish list for christmas.

Is there any other platform/architecture of interest on your wish list I should investigate which is important to you? PPC?

/Erik

On 04 Sep 2014, at 11:20, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:

> Hi Erik,
> 
> On Thursday 04 September 2014 09.05.13 Erik Österlund wrote:
>> Hi,
>> 
>> The implementation of single byte Atomic::cmpxchg on x86 (and all other
>> platforms) emulates the single byte cmpxchgb instruction using a loop of
>> jint-sized load and cmpxchgl and code to dynamically align the destination
>> address.
>> 
>> This code is used for GC-code related to remembered sets currently.
>> 
>> I have the changes on my platform (amd64, bsd) to simply use the native
>> cmpxchgb instead but could provide a patch fixing this unnecessary
>> performance glitch for all supported x86 if anybody wants this?
> 
> I think that sounds good.
> Would you mind looking at other cpu arches to see if they provide something 
> similar? It's ok if you can't build the code for the other arches, I can help 
> you with that.
> 
> /Mikael
> 
>> 
>> /Erik
>