[aarch64-port-dev ] RFR: 8080293: AARCH64: Remove unnecessary dmbs from generated CAS code

Mon Aug 24 14:31:29 UTC 2015

The following webrev against hs-comp head fixes 8080293

  http://cr.openjdk.java.net/~adinn/8080293/webrev.00/

It is a follow on to the prior volatile object patch

  8078743: AARCH64: Extend use of stlr to cater for volatile object stores
  http://cr.openjdk.java.net/~adinn/8078743/webrev.04/

and requires that previous patch to be applied first.

Testing
-------

The patch is sensitive to GC configuration so it was tested against 5
relevant configs

  G1
 CMS+UseCondCardMark
 CMS-UseCondCardMark
 Par+UseCondCardMark
 Par-UseCondCardMark

The validity of the transformation was verified by:

  generating and eyeballing compiled code for simple test programs
  successfully running a fairly large program (netbeans)
  generating and eyeballing HashMap code compiled on a fairly large
program run

The fix was performance tested on 2 implementations of the AArch64
architecture (more details below). On an O-O-O CPU it gave no noticeable
benefit. On a simple pipeline CPU it gave a very significant benefit in
specific cases.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters
(USA), Michael O'Neill (Ireland)

The Test
--------

As with the prior patch I tested the original vs new code generation
strategy by running a jmh test first with -XX:+UseBarriersForVolatile
and then with -XX:+UseBarriersForVolatile. Four different test programs
ran in all 5 GC configs executing. Each test executed repeated CAS
operations to an object field in a single thread with a BlackHole
backoff between CASes varying from 0 to 64.

Test one performed a CAS guaranteed to fail; test two performed a
successful CAS from a fixed object to null and then back; test three
performed a successful CAS from a fixed object to another fixed object
and then back; test four performed a successful CAS from a fixed
object to a newly allocated object and then back. The average time per
CAS operation (ns/op) -- actually per 2 CAS operations for the latter 3
tests -- was used as a score.

The Results
-----------

On an O-O-O CPU there was no significant difference in the time taken.

On a simple pipeline CPU the optimization gave a very significant
benefit for the Fail tests on all GC configurations except CMS
+ UseCondCardMark. In all other cases there was no significant
measurable benefit.

Example Test
------------

package org.openjdk;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicReference;

@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CasNull {

    Object tombstone;

    AtomicReference<Object> ref;

    @Param({"0", "1", "2", "4", "8", "16", "32", "64"})
    int backoff;

    @Setup
    public void setup() {
        tombstone = new Object();

        ref = new AtomicReference<>();
        ref.set(tombstone);
    }

    @Benchmark
    public boolean test() {
        Blackhole.consumeCPU(backoff);
        ref.compareAndSet(tombstone, null);
        ref.compareAndSet(null, tombstone);
	return true;
    }
}