RFR: 8078743: AARCH64: Extend use of stlr to cater for volatile object stores

Fri Jul 31 13:32:45 UTC 2015

Hi,

On 07/31/2015 11:33 AM, Andrew Dinn wrote:

> On 30/07/15 20:19, Vladimir Kozlov wrote:
>> First, thank you for extensive comments - they help.
> 
> They were a necessity for me as much as anyone else :-)
> 
>> Second, does it really help? I don't see any numbers.
> 
> Hmm, running on prejudice, maybe try science? good idea!
> 
> I will obtain numbers.

That's not easy because AArch64 is a specification, not an
implementation.  Going the route of load acquire/store release may not
help much on some chips, but conversations I've had with ARM
architects tell me that they should be preferred.  In particular,
store release for volatiles means that we can avoid a full fence.

Current status: on one out-of-order implementation of AArch64 I see no
difference between "stlr" and "dmb st; str ; dmb ish".  On another,
this time an in-order processor, "stlr" is 40% faster.  This is just
the execution for a few instructions, like this:

.L3:
        ldr     w2, [x1]
        add     w2, w2, 1
        str     w2, [x1]
        stlr x4, [x3]
        subs    x0, x0, #1
        bne     .L3

versus this:

.L3:
        ldr     w2, [x1]
        add     w2, w2, 1
        str     w2, [x1]
        dmb st; str x4, [x3]; dmb ish
        subs    x0, x0, #1
        bne     .L3

The guidelines from ARM are that we should optimize for the simpler
in-order processors; it won't help the out-of-order parts very much,
but it won't hurt either.

Andrew.