[aarch64-port-dev ] RFR: 8080293: AARCH64: Remove unnecessary dmbs from generated CAS code
Andrew Dinn
adinn at redhat.com
Mon Aug 24 14:31:29 UTC 2015
The following webrev against hs-comp head fixes 8080293
http://cr.openjdk.java.net/~adinn/8080293/webrev.00/
It is a follow on to the prior volatile object patch
8078743: AARCH64: Extend use of stlr to cater for volatile object stores
http://cr.openjdk.java.net/~adinn/8078743/webrev.04/
and requires that previous patch to be applied first.
Testing
-------
The patch is sensitive to GC configuration so it was tested against 5
relevant configs
G1
CMS+UseCondCardMark
CMS-UseCondCardMark
Par+UseCondCardMark
Par-UseCondCardMark
The validity of the transformation was verified by:
generating and eyeballing compiled code for simple test programs
successfully running a fairly large program (netbeans)
generating and eyeballing HashMap code compiled on a fairly large
program run
The fix was performance tested on 2 implementations of the AArch64
architecture (more details below). On an O-O-O CPU it gave no noticeable
benefit. On a simple pipeline CPU it gave a very significant benefit in
specific cases.
regards,
Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters
(USA), Michael O'Neill (Ireland)
The Test
--------
As with the prior patch I tested the original vs new code generation
strategy by running a jmh test first with -XX:+UseBarriersForVolatile
and then with -XX:+UseBarriersForVolatile. Four different test programs
ran in all 5 GC configs executing. Each test executed repeated CAS
operations to an object field in a single thread with a BlackHole
backoff between CASes varying from 0 to 64.
Test one performed a CAS guaranteed to fail; test two performed a
successful CAS from a fixed object to null and then back; test three
performed a successful CAS from a fixed object to another fixed object
and then back; test four performed a successful CAS from a fixed
object to a newly allocated object and then back. The average time per
CAS operation (ns/op) -- actually per 2 CAS operations for the latter 3
tests -- was used as a score.
The Results
-----------
On an O-O-O CPU there was no significant difference in the time taken.
On a simple pipeline CPU the optimization gave a very significant
benefit for the Fail tests on all GC configurations except CMS
+ UseCondCardMark. In all other cases there was no significant
measurable benefit.
Example Test
------------
package org.openjdk;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicReference;
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CasNull {
Object tombstone;
AtomicReference<Object> ref;
@Param({"0", "1", "2", "4", "8", "16", "32", "64"})
int backoff;
@Setup
public void setup() {
tombstone = new Object();
ref = new AtomicReference<>();
ref.set(tombstone);
}
@Benchmark
public boolean test() {
Blackhole.consumeCPU(backoff);
ref.compareAndSet(tombstone, null);
ref.compareAndSet(null, tombstone);
return true;
}
}
More information about the aarch64-port-dev
mailing list