From boehm at acm.org Fri Aug 7 18:03:02 2020 From: boehm at acm.org (Hans Boehm) Date: Fri, 7 Aug 2020 11:03:02 -0700 Subject: [jmm-dev] Heads up about ARM SVE memory model Message-ID: ARM's Scalable Vector Extension has some interesting properties that may be relevant if and when this group becomes more active again. I just wanted to make sure that everyone here is aware of its memory model relaxations, since they might impact future discussions. The most most interesting rules for this group from the documentation (which I think you can get to from https://developer.arm.com/documentation/ddi0584/ah/) are: 1) "An address dependency between two reads generated by SVE vector load instructions does not contribute to the Dependency-ordered-before relation." 2) "For a given observer, a pair of reads from the same location is not required to satisfy the internal visibility requirement if at least one of the reads was generated by an SVE load instruction." By (1), if we were to use only SVE vector instructions to read final field f from a reference field x, as in r1 = x; r2 = r1.f, we would not guarantee sufficient ordering for final field semantics. The r1.f read could be advanced past the read of x. so you could see an incompletely initialized final field. (Some of you may remember this relaxation from DEC Alpha, which had it for ordinary load instructions, not just vector ones.) (1) doesn't seem to be a real problem for current implementations. It's sufficient for implementers to ensure that the load of a reference and a dependent final field are not BOTH performed by vector instructions. I think this constraint is generally satisfied due to insufficiently aggressive vectorization. But it is something for implementers to keep in mind. The constraint would become stronger if the ordering guarantees were strengthened to non-final fields (which I also still dislike for other, more fundamental, reasons). (2) basically relaxes the "cache coherence" property for vector loads: Accesses to a single location no longer need to appear to occur in a single total order. We don't guarantee that in Java anyway, except for opaque or stronger accesses. We normally fail to guarantee that to accommodate compiler optimizations like CSE, and to accommodate some more obscure hardware; this adds a hardware justification for common hardware. Opaque and stronger accesses should either not be vectorized, or they may need added fencing. (The corresponding implication on C++ memory_order_relaxed probably has more impact.) Artem Serov and Jade Aglave helped with clarifying this. But any mistakes here are no doubt mine.