[aarch64-port-dev ] RFR: 8204331: AArch64: fix CAS not embedded in normal graph error

Thu Jun 21 17:49:26 UTC 2018

Hi Andrew,

thanks so much for fixing this!

Also, very nice ASCII drawings and explanations! :-)

As far as I can tell, the patch is good.

I'm wondering why it's worth to go through that effort to verify the
correct shape of CAS and similar nodes, and keep maintaining this code
in light of changes (Shenandoah will be the next headache to fix this).
Do the instruction only match those particular shapes, and if it
changes, it'd throw off the matcher badly?

Thanks,
Roman

> The following patch fixes a problem in the AArch64 port which was caused
> by the introduction of the GC API which delegates responsibility for
> barrier generation to GC code (JDK-8202377)
> 
>   webrev:  http://cr.openjdk.java.net/~adinn/8204331/webrev.00/
> 
>   JIRA: https://bugs.openjdk.java.net/browse/JDK-8204331
> 
> The new G1 barrier assembler in combination with the C2 fence generator
> introduced a few small changes to the the memory flow layout in ideal
> subgraphs arising from (normal and Unsafe) volatile stores and (Unsafe)
> volatile CASes. This affects the results returned by the AARch64
> predicates employed by ad file instructions to suppress generation of
> certain hw memory barriers instructions and, instead, generate acquiring
> loads + releasing stores. The change was caught by asserts in the CAS
> code which detected the new, unexpected graph shape.
> 
> Reviews would be very welcome!
> 
> The Patch:
> 
> The fix involves several small changes to reflect each change introduced
> by the GC code.
> 
> 1) Re-ordering of trailing volatile/cpu order barriers
> The C2 fence generator class reversed the order of the volatile and cpu
> order pair generated at the end of an Unsafe volatile store subgraph
> (from Volatile=>CPUOrder to CPUOrder=>Volatile). This is now uniform
> with the order of the trailing acquire/cpu order barriers
> (CPUOrder=>Acquire).
> 
> The relevant predicates have been updated to expect and check for
> barriers in this order.
> 
> 2) Extra memory flows in CAS graphs. Unsafe CASObject graphs for GCs
> which employ a conditional card mark (G1GC or CMS+USeCondCardMark) may
> now include an extra If=>IfTrue/IfFalse ===> Region+Phi(BotIdx) between
> the card mark memory barrier and the trailing cpu order/acquire
> barriers. The new IfNode models a test of the boolean result returned by
> the CAS operation. It gets wrapped up in the barrier flow because of the
> different order of generation of the GC post barrier and the (SCMemProj)
> memory projection from the CAS itself.
> 
> Previously the CAS and SCMemProj simply fed direct into the trailing
> barriers so there was no need to check the GC post-barrier subgraph. Now
> a CASObject looks more like a volatile StoreN/P, both of them feeding
> their memory flow into the card mark membar. In consequence, testing for
> a CASObject now requires two stages as per a StoreN/P -- matching the
> pair of memory subgraphs from leading to card mark membar and from card
> mark to trailing membar.
> 
> The predicates which traverse the memory flow between leading and
> trailing/card mark membars have been updated so they now look for the
> same pattern in either case, modulo the presence of a Store or a
> CAS+SCMemProj pair. In the case where a card mark member is present this
> new test is now combined with a check on the trailing graph in the CAS
> case, just as with a StoreN/P.
> 
> The predicates which traverse the memory flow between card mark and
> trailing membars  have been updated to search through (potentially) one
> more Phi(BotIdx) link. So, the maximum number of Phis to indirect trough
> has been increased to 4 for G1GC and 1 for ConcMarkSweep+UseCondCardMark.
> 
> Testing:
> 
> This has been tested by running tier1 tests. All the tests which were
> previously failing are now working. There were 5 failures due to
> timeouts which do not appear to relate to this issue.
> 
> The code has also been tested by printing generated code for some simple
> usages of volatile load/store and CAS, checking the generated code by
> eyeball. This was repeated for SerialGC, ParallelGC,
> ConcMarkSweep-UseCondCardMark, ConcMarkSweep+UseCondCardMark and G1GC.
> 
> It would be good to also run further tests (esp jcstress). However,
> since this breakage is stopping testing of other changes and since it
> seems to remedy the problem in common cases at least I suggest
> committing this fix now and running these tests afterwards.
> 
> Additional test specific to this issue?:
> 
> There are currently no proper tests for this code as it was difficult to
> know how to identify what code gets generated. Detection of failures is
> left to asserts in the predicates. That approach has worked in the
> present case but only after a breaking change has been pushed and only
> by detecting one of several breaking changes. So the current regime is
> rather fragile. It would be good to have some jtreg tests to help detect
> breakages like this more quickly and effectively.
> 
> A slowdebug build with an available hsdis-aarch64.so could be used to
> print out generated assembly. Usefully, in debug mode the C2 back end
> will print out block comments indicating both where membars have been
> placed and where they have been elided. If a jtreg test was run in an
> otherjvm with CompileCommand options to compile and print target methods
> then the output could be scanned to look for the relevant membar/membar
> elided comments and for presence or absence of ldar/stlr and ldaxr/stlxr
> instructions. This would be enough to test that basic transformations
> are sound.
> 
> Is it possible to rely on an hsdis library being available in jtreg
> tests? Alternatively, is it possible to make execution of the test
> conditional on the library being present?
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
>