RFR: 8253843: AArch64: Use ishst for storestore barrier

Fri Oct 2 14:24:54 UTC 2020

On 02/10/2020 09:59, Alan Hayward wrote:
>

>>
>> The trouble with using asms for this is that the compiler does not
>> know what is going on inside an asm. If you use an asm with a memory
>> clobber for a StoreStore barrier, the compiler has to treat the asm as
>> a full memory barrier, so it cannot move any loads and stores across
>> the StoreStore. So, paradoxically, you might be making things worse.
>>
>
> Ok, I agree with that. (maybe a comment in the code would be useful here)
>
> At the risk of asking something that's probably been asked before ....Why
> then do we have inline asm for other targets? Wouldn't it be better to
> have a common (or OS?) level file for these functions?

There's history here. A lot of history.

Back in the day, there was very poor support for multi-threaded
programs in C++; there was even a famous paper "Threads cannot be
implemented as a Library" by Hans Boehm. At that time, compiler
authors had to invent nonstandard ways to support memory fences, and
in the very early days (when Java was written) there was nothing but
asms.

Fast forward to recent times, and there was a rewrite of HotSpot's
support for atomic memory ops. At that point the issue of using
compiler builtins rather than asms came up. The objection to builtins
was

    I agree that on x86, there isn't a whole lot of other things the
    compiler could do with the intrinsics than what we want it to do
    due to the relatively strong memory model of the machine. So this
    might be a possible simplification on x86 gcc/clang targets (but
    still not all x86 targets).

    As for PPC and ARMv7 though, that is not true any longer. For
    example, our conservative memory model is more conservative than
    seq_cst semantics. E.g. it also has "leading sync" semantics
    always guaranteed, which is exploited in our code base and would
    be broken if translated simply as seq_cst. Also, since the fencing
    from the C++ compiler must be compliant with what our code
    generation does, they could end up being incompatible due to
    choice of different fencing conventions. Intrinsic provided
    operations may or may not have leading sync semantics. We can hope
    for it, but we should never rely on it.

    https://mail.openjdk.java.net/pipermail/hotspot-dev/2017-September/028238.html

I didn't agree with this reasoning for AArch64 because it's a new
architecture and the mappings from the ISA to the C++ standard
intrinsics are well defined, if somewhat informally.

Using builtins turned out to have been a good decision for
AArch64. When Neoverse N1 came out with a poorly-performing
implementation of ldx/stx but fast LSE compaure-and-swap instructions
GCC knew how to do the right thing, so we didn't need to change
anything.

> Also, it looks like using the C++ atomic functions there seems to be
> no way to generate a dmb ishst.

Seems like it, probably because its semantics in high-level-language
terms are horrid. There's another gem from Hans Boehm:

https://www.hboehm.info/c++mm/no_write_fences.html

It's something of a shame that AArch64 can't give us a release fence
instruction.

> Happy to drop that. But, if it is more complicated than that, then
> is it wrong for other architectures too? Should that whole table be
> removed?

I have no idea. I think it makes more sense to look at the code.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671