[aarch64-port-dev ] RFR: 8078743: AARCH64: Extend use of stlr to cater for volatile object stores

Fri Aug 14 09:01:02 UTC 2015

Any chance of getting my updated patch reviewed by 2 people and
sponsored it to go into hs-comp?

I now have the follow on patch for issue ready for review

  8080293: AARCH64: Remove unnecessary dmbs from generated CAS code

So it would be very helpful if this one could be checked. Thanks!

regards,

Andrew Dinn
-----------

On 12/08/15 13:45, Andrew Dinn wrote:
> Hi Vladimir,
> 
> Apologies for the delay in responding to your feedback -- I was
> traveling for a team meeting all of last week.
> 
> Here is a revised webrev which includes all the code changes you suggested
> 
>   http://cr.openjdk.java.net/~adinn/8078743/webrev.04
> 
> Also, as requested I did some testing on the two AArch64 machines to
> which I have access. Does it help? Short answer: yes it is well worth
> doing as it causes no harm on the sort of architecture where you would
> expect no benefit and helps a lot on the sort of architecture where you
> would expect it to help. More details below.
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
> 
> The Tests
> ---------
> 
> I ran some simple tests using the jmh micro-benchmark harness, first
> using the old style dmb based implementation (i.e. passing
> -XX:+UseBarriersForVolatile) and then using the new style stlr-based
> implementation (using -XX:-UseBarriersForVolatile). Each test was run in
> each of the 5 relevant GC configs:
> 
>   +G1GC
>   +CMS +UseCondCardMark
>   +CMS -UseCondCardMark
>   +Par +UseCondCardMark
>   +Par -UseCondCardMark
> 
> The tests were derived from Alexey Shipilev's recently posted CAS test,
> tweaked to do volatile stores instead of CASes. Each test employs a
> single thread which repeatedly writes a volatile field
> (AtomicReference<Object>.set). A delay call follows each write
> (BlackHole.consumeCPU) with the delay argument varying from 0 to 64. A
> single AtomicReference instance is employed throughout the test.
> 
> Test one always writes null; test two always writes a fixed object; test
> three writes an object newly allocated at each write (example source for
> the null write test is included below). This range of tests allows
> various elements of the write barrier to be omitted at generate time or
> run time, depending upon the GC config.
> 
> In each case the result was recorded as the average number of
> nanoseconds per write operation (ns/op). I am afraid I am not in a
> position to give the actual timings on any specific architecture or,
> indeed, name what hardware was used. However, I can give a qualitiative
> account of what I found and it pretty much accords with Andrew Haley's
> expectations.
> 
> Main Results
> ------------
> 
>   With the first (O-O-O CPU) implementation of AArch64 there was no
> statistically significant variation in the ns/op.
> 
>   With the other (simple pipeline CPU) implementation for most of the
> test space there was a very significant improvement (decrease) in ns/op
> for the stlr version when compared against the equivalent barrier
> implementation
> 
> Detailed Results
> ----------------
> 
> The second machine showed some interesting variations in performance
> improvement which are worth mentioning:
> 
>   - in the best case ns/op was cut by 50% (CMS - UseCondCardMark,
> backoff 0, old value write)
> 
>   - at backoff 0 in most cases ns/op was cut by ~30-35% for null/old
> value write and ~15-20% for young value write
> 
>   - at backoff 64 in most cases ns/op was cut by ~5-10%  (n.b. this is
> mostly to do with the addition of wasted backoff time -- there was only
> a small decrease in the absolute times)
> 
>   - with most GC configs greatest improvement was with old value write,
> least improvement with young value write
> 
> the above general results did not apply for 2 specific data points
> 
>   - with CMS + UseCondCardMark no significant %ge change was seen for
> old value writes
> 
>   - with Par + UseCondCardMark no significant %ge change was seen for
> young value writes
> 
> These last 2 results are a bit odd.
> 
>   For both old and young puts CMS + UseCondCardMark requires a dmb ish
> after the stlr to ensure the card read does not float above the volatile
> store. For null puts the dmb gets elided (type info tells the compiler
> no card mark needed). So, the difference here between old and young
> writes is unexpected but must be down to the effect of conditional card
> marking rather than the barriers vs stlr.
> 
>   Par + UseCondCardMark employs no synchronization for the card mark.
> Once again the null write case will not need a card mark but the other
> two cases will. So, once again the disparity in the improvement between
> these two cases is unexpected but must be down to the effect of
> conditional card marking rather than the barriers vs stlr.
> 
> Example Test Class
> -------------------
> 
> package org.openjdk;
> 
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.infra.Blackhole;
> 
> import java.util.concurrent.TimeUnit;
> import java.util.concurrent.atomic.AtomicReference;
> 
> @Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
> @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
> @Fork(3)
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @State(Scope.Benchmark)
> public class VolSetNull {
> 
>     AtomicReference<Object> ref;
> 
>     @Param({"0", "1", "2", "4", "8", "16", "32", "64"})
>     int backoff;
> 
>     @Setup
>     public void setup() {
>         ref = new AtomicReference<>();
>         ref.set(new Object());
>     }
> 
>     @Benchmark
>     public boolean test() {
>         Blackhole.consumeCPU(backoff);
>         ref.set(null);
> 	return true;
>     }
> }
> 

-- 
regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters
(USA), Michael O'Neill (Ireland)