From boehm at acm.org  Fri Aug  7 18:03:02 2020
From: boehm at acm.org (Hans Boehm)
Date: Fri, 7 Aug 2020 11:03:02 -0700
Subject: [jmm-dev] Heads up about ARM SVE memory model
Message-ID: <CAPUmR1a5OPBMvmj=RzuBn2x_VJt5zZYsts--jS-HQJjhwdMxXw@mail.gmail.com>

ARM's Scalable Vector Extension has some interesting properties that may be
relevant if and when this group becomes more active again. I just wanted to
make sure that everyone here is aware of its memory model relaxations,
since they might impact future discussions.

The most most interesting rules for this group from the documentation
(which I think you can get to from
https://developer.arm.com/documentation/ddi0584/ah/) are:

1) "An address dependency between two reads generated by SVE vector load
instructions does not contribute to the Dependency-ordered-before relation."
2) "For a given observer, a pair of reads from the same location is not
required to satisfy the internal visibility requirement if at least one of
the reads was generated by an SVE load instruction."

By (1), if we were to use only SVE vector instructions to read final field
f from a reference field x, as in r1 = x; r2 = r1.f, we would not guarantee
sufficient ordering for final field semantics. The r1.f read could be
advanced past the read of x. so you could see an incompletely initialized
final field. (Some of you may remember this relaxation from DEC Alpha,
which had it for ordinary load instructions, not just vector ones.)

(1) doesn't seem to be a real problem for current implementations. It's
sufficient for implementers to ensure that the load of a reference and a
dependent final field are not BOTH performed by vector instructions. I
think this constraint is generally satisfied due to insufficiently
aggressive vectorization. But it is something for implementers to keep in
mind. The constraint would become stronger if the ordering guarantees were
strengthened to non-final fields (which I also still dislike for other,
more fundamental, reasons).

(2) basically relaxes the "cache coherence" property for vector loads:
Accesses to a single location no longer need to appear to occur in a single
total order. We don't guarantee that in Java anyway, except for opaque or
stronger accesses. We normally fail to guarantee that to accommodate
compiler optimizations like CSE, and to accommodate some more obscure
hardware; this adds a hardware justification for common hardware. Opaque
and stronger accesses should either not be vectorized, or they may need
added fencing. (The corresponding implication on C++ memory_order_relaxed
probably has more impact.)

Artem Serov and Jade Aglave helped with clarifying this. But any mistakes
here are no doubt mine.