Scoped values: a first cut at benchmarking
Andrew Haley
aph at redhat.com
Thu Sep 12 16:42:49 UTC 2019
This is just a taster, and I'll post an explanation later along with
the patch itself.
Here are the benchmarks I've been using during development. All values
are Scoped<Integer> then ThreadLocal<Integer>, and the results
are posted in order of increasing wonderfulness.
1. This is a single get(), where the JMH overhead of merely calling a
test method is significant:
@Benchmark
public Integer simpleTLGet(MyState state) {
return BenchmarkState.t1.get();
}
ThreadLocalTest.simpleScopedGet avgt 3 5.447 ± 0.003 ns/op
ThreadLocalTest.simpleTLGet avgt 3 7.273 ± 0.906 ns/op
2. This repeatedly get()s a value then stores it into an
AtomicReference. The use of AtomicRef.setOpaque forces all of the
stores to actually happen, preventing the optimizer from hoisting
get()s and set()s out of the loop:
long floss1(ThreadLocal<Integer> t1) {
long l = 0;
for (int i = 0; i < 1_000; i++) {
BenchmarkState.atomicRef.setOpaque(t1.get());
}
return l;
}
@Benchmark
public void test1(BenchmarkState state) {
floss1(state.t1);
}
ThreadLocalTest.scopedTest1 avgt 3 1873.975 ± 87.717 ns/op
ThreadLocalTest.test1 avgt 3 5066.072 ± 3070.626 ns/op
3. Much the same, but using six values:
@Benchmark
public void test2(BenchmarkState state) {
for (int i = 0; i < 166; i++) {
BenchmarkState.atomicRef.setOpaque(state.t1.get());
BenchmarkState.atomicRef.setOpaque(state.t2.get());
BenchmarkState.atomicRef.setOpaque(state.t3.get());
BenchmarkState.atomicRef.setOpaque(state.t4.get());
BenchmarkState.atomicRef.setOpaque(state.t5.get());
BenchmarkState.atomicRef.setOpaque(state.t6.get());
}
}
ThreadLocalTest.scopedTest2 avgt 3 1919.582 ± 573.822 ns/op
ThreadLocalTest.test2 avgt 3 4845.745 ± 309.467 ns/op
This test is to make sure that the performance scales reasonably well
with increasing numbers of values.
4. This get()s and sums a value a million times, first using a scoped
value then a ThreadLocal:
@Benchmark
public long summation(MyState state) {
long sum = 0;
for (int i = 0; i < 1_000_000; i++) {
sum += BenchmarkState.t1.get();
}
return sum;
}
ThreadLocalTest.scopedSummation avgt 3 286998.680 ± 1122.577 ns/op
ThreadLocalTest.summation avgt 3 3478884.173 ± 305545.897 ns/op
Note that the time using a scoped value is actually 0.2ns (!) per
iteration. This is because the scoped value is hoisted into a register
by JIT loop optimization, so the generated code for the loop looks
like:
;; B7: # out( B7 B8 ) <- in( B6 B7 ) Loop( B7-B7 inner main of N66 strip mined) Freq: 9.73958e+11
↗ 0x00007f8708456260: add rax,r10
3.06% │ 0x00007f8708456263: add rax,r10
2.25% │ 0x00007f8708456266: add rax,r10
2.91% │ 0x00007f8708456269: add rax,r10
3.17% │ 0x00007f870845626c: add rax,r10
3.60% │ 0x00007f870845626f: add rax,r10
2.76% │ 0x00007f8708456272: add rax,r10
2.79% │ 0x00007f8708456275: add rax,r10
2.93% │ 0x00007f8708456278: add rax,r10
...
5. This benchmark increments an AtomicInteger (accessed via a scoped
value or a ThreadLocal) a million times:
void task_counter() {
AtomicInteger n = BenchmarkState.t7.get();
n.setPlain(n.getPlain() + 1);
}
@Benchmark
public long testCounter(MyState state) {
long l = 0;
for (int i = 0; i < 1_000_000; i++) {
task_counter();
}
return BenchmarkState.t7.get().getPlain();
}
ThreadLocalTest.scopedTestCounter avgt 3 5.735 ± 0.026 ns/op
ThreadLocalTest.testCounter avgt 3 3441076.699 ± 143503.626 ns/op
I was not expecting this, and had to check the result very carefully.
It's obviously not possible to count up to a million in 6 nanoseconds.
The scoped version is approx. a million times faster than the version
using a ThreadLocal! This perhaps requires a little explanation. When
using a scoped value, C2 can optimize more readily than ThreadLocals,
so it replaces the entire loop with an addition.
I'm sure this isn't typical usage, either for a ThreadLocal or a
scoped value, so the benchmark is of dubious value, but it does show
that an JIT compiler can do more optimization of scoped values.
Finally, full disclosure: I am accelerating scoped lookups by using a
16-entry (thread-local) cache to access to a small set of values. We
could use same technique with ThreadLocals, but we don't.
The cache doesn't always help but it rarely makes anything
significantly worse, although there is some small overhead for loading
it on a cache miss. I believe that, given the usual principle of
locality, this is a legitimate thing to do and is a worthwhile
optimization in most cases.
Here are the results from the set of tests, showing the scoped results
with and without the cache, the the same tests using ThreadLocals, all
in ns:
with cache: without: ThreadLocal:
ThreadLocalTest.simpleGet avgt 3 5.451 6.913 7.225
ThreadLocalTest.Test1 avgt 3 1928.372 4970.853 5116.356
ThreadLocalTest.Test2 avgt 3 1918.450 5020.094 4509.272
ThreadLocalTest.Summation avgt 3 286,722.232 286,901.079 3,505,135.568
ThreadLocalTest.TestCounter avgt 3 5.735 7.482 3,397,826.003
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the loom-dev
mailing list