[aarch64-port-dev ] RFR: 8078743: AARCH64: Extend use of stlr to cater for volatile object stores
Andrew Dinn
adinn at redhat.com
Fri Aug 14 09:01:02 UTC 2015
Any chance of getting my updated patch reviewed by 2 people and
sponsored it to go into hs-comp?
I now have the follow on patch for issue ready for review
8080293: AARCH64: Remove unnecessary dmbs from generated CAS code
So it would be very helpful if this one could be checked. Thanks!
regards,
Andrew Dinn
-----------
On 12/08/15 13:45, Andrew Dinn wrote:
> Hi Vladimir,
>
> Apologies for the delay in responding to your feedback -- I was
> traveling for a team meeting all of last week.
>
> Here is a revised webrev which includes all the code changes you suggested
>
> http://cr.openjdk.java.net/~adinn/8078743/webrev.04
>
> Also, as requested I did some testing on the two AArch64 machines to
> which I have access. Does it help? Short answer: yes it is well worth
> doing as it causes no harm on the sort of architecture where you would
> expect no benefit and helps a lot on the sort of architecture where you
> would expect it to help. More details below.
>
> regards,
>
>
> Andrew Dinn
> -----------
>
> The Tests
> ---------
>
> I ran some simple tests using the jmh micro-benchmark harness, first
> using the old style dmb based implementation (i.e. passing
> -XX:+UseBarriersForVolatile) and then using the new style stlr-based
> implementation (using -XX:-UseBarriersForVolatile). Each test was run in
> each of the 5 relevant GC configs:
>
> +G1GC
> +CMS +UseCondCardMark
> +CMS -UseCondCardMark
> +Par +UseCondCardMark
> +Par -UseCondCardMark
>
> The tests were derived from Alexey Shipilev's recently posted CAS test,
> tweaked to do volatile stores instead of CASes. Each test employs a
> single thread which repeatedly writes a volatile field
> (AtomicReference<Object>.set). A delay call follows each write
> (BlackHole.consumeCPU) with the delay argument varying from 0 to 64. A
> single AtomicReference instance is employed throughout the test.
>
> Test one always writes null; test two always writes a fixed object; test
> three writes an object newly allocated at each write (example source for
> the null write test is included below). This range of tests allows
> various elements of the write barrier to be omitted at generate time or
> run time, depending upon the GC config.
>
> In each case the result was recorded as the average number of
> nanoseconds per write operation (ns/op). I am afraid I am not in a
> position to give the actual timings on any specific architecture or,
> indeed, name what hardware was used. However, I can give a qualitiative
> account of what I found and it pretty much accords with Andrew Haley's
> expectations.
>
> Main Results
> ------------
>
> With the first (O-O-O CPU) implementation of AArch64 there was no
> statistically significant variation in the ns/op.
>
> With the other (simple pipeline CPU) implementation for most of the
> test space there was a very significant improvement (decrease) in ns/op
> for the stlr version when compared against the equivalent barrier
> implementation
>
> Detailed Results
> ----------------
>
> The second machine showed some interesting variations in performance
> improvement which are worth mentioning:
>
> - in the best case ns/op was cut by 50% (CMS - UseCondCardMark,
> backoff 0, old value write)
>
> - at backoff 0 in most cases ns/op was cut by ~30-35% for null/old
> value write and ~15-20% for young value write
>
> - at backoff 64 in most cases ns/op was cut by ~5-10% (n.b. this is
> mostly to do with the addition of wasted backoff time -- there was only
> a small decrease in the absolute times)
>
> - with most GC configs greatest improvement was with old value write,
> least improvement with young value write
>
> the above general results did not apply for 2 specific data points
>
> - with CMS + UseCondCardMark no significant %ge change was seen for
> old value writes
>
> - with Par + UseCondCardMark no significant %ge change was seen for
> young value writes
>
> These last 2 results are a bit odd.
>
> For both old and young puts CMS + UseCondCardMark requires a dmb ish
> after the stlr to ensure the card read does not float above the volatile
> store. For null puts the dmb gets elided (type info tells the compiler
> no card mark needed). So, the difference here between old and young
> writes is unexpected but must be down to the effect of conditional card
> marking rather than the barriers vs stlr.
>
> Par + UseCondCardMark employs no synchronization for the card mark.
> Once again the null write case will not need a card mark but the other
> two cases will. So, once again the disparity in the improvement between
> these two cases is unexpected but must be down to the effect of
> conditional card marking rather than the barriers vs stlr.
>
> Example Test Class
> -------------------
>
> package org.openjdk;
>
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.infra.Blackhole;
>
> import java.util.concurrent.TimeUnit;
> import java.util.concurrent.atomic.AtomicReference;
>
> @Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
> @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
> @Fork(3)
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @State(Scope.Benchmark)
> public class VolSetNull {
>
> AtomicReference<Object> ref;
>
> @Param({"0", "1", "2", "4", "8", "16", "32", "64"})
> int backoff;
>
> @Setup
> public void setup() {
> ref = new AtomicReference<>();
> ref.set(new Object());
> }
>
> @Benchmark
> public boolean test() {
> Blackhole.consumeCPU(backoff);
> ref.set(null);
> return true;
> }
> }
>
--
regards,
Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters
(USA), Michael O'Neill (Ireland)
More information about the aarch64-port-dev
mailing list