[aarch64-port-dev ] Fwd: Re: RFR: 8078743: AARCH64: Extend use of stlr to cater for volatile object stores

Thu Aug 13 08:58:02 UTC 2015

Apologies, I replied only to compiler-dev when I should also have posted
to aarch64-port-dev. Could someone other than Vladimir please review
this patch? Also, I will need a sponsor to commit it to hs-comp if/when
it passes review. Thanks!

regards,

Andrew Dinn
-----------

-------- Forwarded Message --------
Subject: Re: RFR: 8078743: AARCH64: Extend use of stlr to cater for
volatile object stores
Date: Wed, 12 Aug 2015 13:45:32 +0100
From: Andrew Dinn <adinn at redhat.com>
To: Vladimir Kozlov <vladimir.kozlov at oracle.com>,
hotspot-compiler-dev at openjdk.java.net

Hi Vladimir,

Apologies for the delay in responding to your feedback -- I was
traveling for a team meeting all of last week.

Here is a revised webrev which includes all the code changes you suggested

  http://cr.openjdk.java.net/~adinn/8078743/webrev.04

Also, as requested I did some testing on the two AArch64 machines to
which I have access. Does it help? Short answer: yes it is well worth
doing as it causes no harm on the sort of architecture where you would
expect no benefit and helps a lot on the sort of architecture where you
would expect it to help. More details below.

regards,

Andrew Dinn
-----------

The Tests
---------

I ran some simple tests using the jmh micro-benchmark harness, first
using the old style dmb based implementation (i.e. passing
-XX:+UseBarriersForVolatile) and then using the new style stlr-based
implementation (using -XX:-UseBarriersForVolatile). Each test was run in
each of the 5 relevant GC configs:

  +G1GC
  +CMS +UseCondCardMark
  +CMS -UseCondCardMark
  +Par +UseCondCardMark
  +Par -UseCondCardMark

The tests were derived from Alexey Shipilev's recently posted CAS test,
tweaked to do volatile stores instead of CASes. Each test employs a
single thread which repeatedly writes a volatile field
(AtomicReference<Object>.set). A delay call follows each write
(BlackHole.consumeCPU) with the delay argument varying from 0 to 64. A
single AtomicReference instance is employed throughout the test.

Test one always writes null; test two always writes a fixed object; test
three writes an object newly allocated at each write (example source for
the null write test is included below). This range of tests allows
various elements of the write barrier to be omitted at generate time or
run time, depending upon the GC config.

In each case the result was recorded as the average number of
nanoseconds per write operation (ns/op). I am afraid I am not in a
position to give the actual timings on any specific architecture or,
indeed, name what hardware was used. However, I can give a qualitiative
account of what I found and it pretty much accords with Andrew Haley's
expectations.

Main Results
------------

  With the first (O-O-O CPU) implementation of AArch64 there was no
statistically significant variation in the ns/op.

  With the other (simple pipeline CPU) implementation for most of the
test space there was a very significant improvement (decrease) in ns/op
for the stlr version when compared against the equivalent barrier
implementation

Detailed Results
----------------

The second machine showed some interesting variations in performance
improvement which are worth mentioning:

  - in the best case ns/op was cut by 50% (CMS - UseCondCardMark,
backoff 0, old value write)

  - at backoff 0 in most cases ns/op was cut by ~30-35% for null/old
value write and ~15-20% for young value write

  - at backoff 64 in most cases ns/op was cut by ~5-10%  (n.b. this is
mostly to do with the addition of wasted backoff time -- there was only
a small decrease in the absolute times)

  - with most GC configs greatest improvement was with old value write,
least improvement with young value write

the above general results did not apply for 2 specific data points

  - with CMS + UseCondCardMark no significant %ge change was seen for
old value writes

  - with Par + UseCondCardMark no significant %ge change was seen for
young value writes

These last 2 results are a bit odd.

  For both old and young puts CMS + UseCondCardMark requires a dmb ish
after the stlr to ensure the card read does not float above the volatile
store. For null puts the dmb gets elided (type info tells the compiler
no card mark needed). So, the difference here between old and young
writes is unexpected but must be down to the effect of conditional card
marking rather than the barriers vs stlr.

  Par + UseCondCardMark employs no synchronization for the card mark.
Once again the null write case will not need a card mark but the other
two cases will. So, once again the disparity in the improvement between
these two cases is unexpected but must be down to the effect of
conditional card marking rather than the barriers vs stlr.

Example Test Class
-------------------

package org.openjdk;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicReference;

@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class VolSetNull {

    AtomicReference<Object> ref;

    @Param({"0", "1", "2", "4", "8", "16", "32", "64"})
    int backoff;

    @Setup
    public void setup() {
        ref = new AtomicReference<>();
        ref.set(new Object());
    }

    @Benchmark
    public boolean test() {
        Blackhole.consumeCPU(backoff);
        ref.set(null);
	return true;
    }
}