[aarch64-port-dev ] Proposed patch for C2 volatile generation (resent)
Andrew Dinn
adinn at redhat.com
Fri Mar 27 16:25:21 UTC 2015
Resending this because I think the original had a problem:
Attached, for review, is a patch which -- I believe -- fixes generation
of code for volatile reads and writes while at the same time retaining
the existing code generation strategy for all other barrier output.
I say 'I believe' because there is one corner case which I think might
break things but I am not really sure whether this corner case can
actually arise. Details are provided as comments in the code but I'll
provide a summary here too.
The status quo (volatile reads)
-------------------------------
The current volatile generation code successfully translates a
LoadX(mo_acquire)
MemBarAcquire
sequence either to
ldr<x>
dmb ishld
or to
ldar<x>
My reading of the C2 shared code indicates that both these translations,
in particular the second one, are always valid. We will only ever see
the indicated node sequence as a result of a volatile read. There is
never a case where a MemBarAcquire can directly follow an (acquire)
ordered LoadX without it being generated from a volatile read. That's
true even in the presence of optimizations.
So, the current code essentially has two sets of rules selected
according to whether UseBarriersForVolatile == true/false.
The first set (UseBarriersForVolatile == true) translates all LoadX
operations to a normal load and all MemBarAcquire operations to dmb.
The second set (UseBarriersForVolatile == false) translates
LoadX(mo_acquire) operations to an acquiring load when they are followed
by a MemBarAcquire and elides generation of the dmb. A
LoadX(mo_unorderd) is translated to a normal load .
Actually, it goes further than that because the second set of rules also
copes with the following sequence
(DecodeNarrowPtr (LoadX(mo_acquire))
MemBarAcquire
which is generated in the case where a compressed oop is loaded.
n.b. I'm glossing over some of the details here -- there is a specific
order of input/output linkage required for these patterns (simlarly in
later cases but the patch is clear about what is needed).
n.b.b. for reasons of consistency with other cases I have renamed the
predicate used for the 2nd set of rules to unnecessary_membar_acquire.
The status quo (volatile writes)
--------------------------------
Currently, handling of volatile writes is not so healthy. A volatile
write is /normally/ translated to a node sequence like the following
MemBarRelease
StoreX(mo_release)
MemBarVolatile
This is the node graph which results both when the store occurs because
of a direct volatile write and when it is the result of an inlined
unsafe access to a volatile field.
The current rules translate that sequence to
dmb ish
str<x>
dmb ish
i.e. they add a redundant dmb after the store. This is achieved by
translating all MemBarRelease nodes to a dmb ish, all MemBarVolatile
nodes to a dmb ish and all StoreX nodes to a normal (non-releasing)
store. Clearly, for a volatile store the trailing dmb is redundant.
n.b. I'll get back to that /normally/ later once we have dealt with the
basic case -- it's a wrinkle but not an intractable one.
The reason for this appears to be a worry that volatile store are not
the only source of MemBarRelease and MemBarVolatile nodes i.e. that this
sequence may not be guaranteed to be present only because of a volatile
store. For example, LibraryKit::inline_unsafe_access plants
MemBarRelease and MemBarVolatile nodes into the node graph as do some
other routines (e.g. exit node processing in parse1.cpp). Another
potential worry is that the compiler might do optimizations which
involve moving or eliding MemBar nodes.
The proposed fix (pt 1)
-----------------------
Even if the above concern is justified when the volatile write node
pattern is seen there can be no harm in translating the first two nodes
into a stlr<x>. If the above translation is valid then the following
will be semantically equivalent and hence also valid:
stlr<x>
dmb ish
So, just as with the volatile read case, it is quite straightforward to
make the current rules apply only when UseBarriersForVolatile == false
and to provide alternative rules enabled when UseBarriersForVolatile ==
false which, when they detect this pattern of nodes, plant an stlr<x>
and elide the dmb. Predicate unnecessary_membar_release provided in the
patch implements supports this transformation when
UseBarriersForVolatile == false.
The proposed fix (pt 1)
-----------------------
This still leaves the tricky question as to whether and when it is
possible to elide the trailing dmb. n.b. notice that whatever the
setting for UseBarriersForVolatile we would always like to do this
elision (assuming it is legitimate). We really only want
stlr<x>
or
dmb ish
stl<x>
I believe it is always possible to elide the dmb when the above pattern
is seen apart from one corner case explained below. From my reading of
the C2 code we will never see this pattern in any of the other cases
where MemBar nodes are introduced/optimized. The predicate
unnecessary_membar_volatile provided in the patch is used to elide dmbs
arising from MemBarVolatile nodes /irrespective/ of the setting of
UseBarriersForVolatile.
The corner case
---------------
The corner case occurs in method LibraryKit::inline_unsafe_fence when
inlining a call to Unsafe.fullFence. The inline subgraph for this call
is a bare MemBarVolatile. This in itself is not a challenge -- a
MemBarVolatile does not constitute the pattern we want to optimize. Nor
indeed is it ever going to be able to be appended to a pair of
MemBarRelease LoadX(mo_release) nodes to complete the target pattern
since w enever see a bare pair of nodes in that form. However, it might
cause a problem if the inlined Unsafe.fullFence directly follows a
volatile store (whether directly planted or inlined).
Imagine we have a method which has a volatile field write followed by a
call to Unsafe.fullFence. We might expect to see an inlined node
sequence generated like this:
MemBarRelease
StoreX(mo_release)
MemBarVolatile
MemBarVolatile
The compiler () will optimize this to
MemBarRelease
StoreX(mo_release)
MemBarVolatile
If we omit the trailing dmb we still retain the correct semantics for
the volatile store -- it cannto be seen before any prior stores.
However, with no trailing dmb we do not get the behaviour required by
the Unsafe.fullFence; we cannot be sure that a subsequent store will not
visible before the releasing store.
The wrinkle
-----------
The volatile write pattern has a minor variant which sometimes arises
with an inlined unsafe putX when the value being written is an object.
If the heap_base input to the store is potentially NULL (it's type is
lower in the type lattice that TypePtr::NULL_PTR) then the store
normally bracketed by the MemBarRelease and MemBarVolatile is replaced
by an if-then-else node sequence including a StoreP<mo_release> in each
branch albeit with (slightly) differing inputs. The current predicates
do not detect this pattern which means we still generate correct code
but pay the cost of two dmbs when this case arises. We can make the
predicate smarter if we want (if it really hurts performance) but I have
simply ignored this for now.
The (in)significance of the corner case?
----------------------------------------
I am not sure whether the corner case inlining pattern will ever
actually happen. Firstly, I don't know if a bare inline of just the
MemBarVolatile will ever occur for a fullFence without also inlining
other code from the Unsafe call (i.e. a checkAccess). Maybe that code
could also be optimized away? I am not sure.
Secondly, even if it is possible for this to happen will it ever
actually happen? Is there any code which relies on using an unsafe
volatile put followed by an call to Unsafe.fullFence? I could find no
actual uses of Unsafe.fullFence anywhere in the hotspot or jdk trees. Of
course, I suppose someone could add a call to Unsafe.fullFence into some
of their own code and then build that code into their own custom
java.base image to provide it with access to Unsafe (or even make Unsafe
visible again) and then run up against this problem. Do we have to worry
about that?
Testing
-------
I have not been able to construct a test to check this specific corner
case. If anyone else can provide pointers as to how to do that I'd be
very grateful. I have run basic tests on the patched code and also run
up netbeans successfully. I intend next to try it with the jcstress suite.
Your input welcome :-)
----------------------
Meanwhile, I would be interested in all feedback on the proposed patch.
regards,
Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters
(USA), Michael O'Neill (Ireland)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 7991.patch
Type: text/x-patch
Size: 14401 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/aarch64-port-dev/attachments/20150327/534641d1/7991-0001.patch>
More information about the aarch64-port-dev
mailing list