MFENCE vs. LOCK addl

Wed Feb 25 14:36:12 PST 2009

Maybe at one point it was an artifact but there's actual a real mfence  
instruction.  It seems like it needs to be renamed, particularly if it  
ever stops producing a real mfence.  Things in Assembler are really  
just supposed to emit exactly what they say they are with no  
translation or optimization.  MacroAssembler is where logic for  
choosing more optimal patterns should live.

tom

On Feb 25, 2009, at 12:53 PM, Paul Hohensee wrote:

> The name 'mfence()' is an artifact.  It means to do the equivalent  
> of OrderAccess::fence().
>
> Paul
>
> Tom Rodriguez wrote:
>> Assembler::mfence is used in places where optimizing it wouldn't  
>> seem to matter to me.  As far as killing the condition flags go I  
>> don't think any piece of code which calls mfence cares.  There are  
>> only about 5 calls which seems easy to audit.  Avoiding push/pop at  
>> all seems much better to me.
>>
>> So if mfence is equivalent in power to any old locked instruction  
>> why is it so much more expensive?  It seems like it either must be  
>> doing something more or it's a really crappy implementation.  Are  
>> we sure the callers don't need the something more part?
>>
>> As an aside, it seems odd that something in the Assembler that's  
>> named after a real instruction would never actually emit that  
>> instruction.  Shouldn't mfence emit mfence and all current callers  
>> call something else which might sometimes emit mfence?
>>
>> tom
>>
>> On Feb 25, 2009, at 12:31 PM, Jiva, Azeem wrote:
>>
>>> John, Paul --
>>> Yeah I had tried that and was in the process of writing that up.   
>>> It is better than an MFENCE and has the added benefit of not  
>>> needing a system with SSE2+.  I still don't have a good JVM case  
>>> but the assembler run shows that it's faster than MFENCE.  My  
>>> naïve change to assembler_x86.cpp:
>>>
>>> void Assembler::mfence() {
>>>   // Memory barriers are only needed on multiprocessors
>>> if (os::is_MP()) {
>>>     // All usable chips support "locked" instructions which suffice
>>>     // as barriers, and are much faster than the alternative of
>>>     // using cpuid instruction. We use here a xchg which is  
>>> implicitly locked
>>>     // This is conveniently otherwise a no-op except for blowing
>>>     // rax (which we save and restore.)
>>>     push(rax);    // Store RAX register
>>>     xchgl(rax, Address(rsp, 0));
>>>     pop(rax);    // Restore RAX register
>>> }
>>> }
>>>
>>> -- 
>>> Azeem Jiva
>>> AMD Java Labs
>>> T 512.602.0907
>>>
>>>> -----Original Message-----
>>>> From: Paul.Hohensee at Sun.COM [mailto:Paul.Hohensee at Sun.COM]
>>>> Sent: Wednesday, February 25, 2009 2:20 PM
>>>> To: John Rose
>>>> Cc: Jiva, Azeem; hotspot compiler
>>>> Subject: Re: MFENCE vs. LOCK addl
>>>>
>>>> Good idea.  Can you try it, Azeem?
>>>>
>>>> Paul
>>>>
>>>> John Rose wrote:
>>>>> What about XCHG?  It doesn't set flags, and (as a bonus) it  
>>>>> implies
>>>>> the effect of a LOCK prefix:
>>>>>   push rax
>>>>>   xchg rax
>>>>>   pop rax
>>>>>
>>>>> -- John
>>>>>
>>>>> On Feb 25, 2009, at 7:05 AM, Jiva, Azeem wrote:
>>>>>
>>>>>> Paul,
>>>>>> Ahh right, I did some experiments with running MFENCE vs. LOCK  
>>>>>> ADDL
>>>>>> and MFENCE vs. PUSH/LOCK ADDL/POPF and found that MFENCE is  
>>>>>> faster
>>>> than
>>>>>> PUSH/LOCK/POP but not faster than just using the LOCK  
>>>>>> instruction by
>>>>>> itself.  A nice optimization would be if the JVM could detect  
>>>>>> if the
>>>>>> condition codes needed to be saved instead of saving them  
>>>>>> always.   This
>>>>>> is on AMD hardware, and other systems might have different
>>>> performance
>>>>>> issues.
>>>>>
>>>
>>>
>>