Scoped values: a first cut at benchmarking

Thu Sep 12 16:42:49 UTC 2019

This is just a taster, and I'll post an explanation later along with
the patch itself.

Here are the benchmarks I've been using during development. All values
are Scoped<Integer> then ThreadLocal<Integer>, and the results
are posted in order of increasing wonderfulness.

1.  This is a single get(), where the JMH overhead of merely calling a
test method is significant:

    @Benchmark
    public Integer simpleTLGet(MyState state) {
        return BenchmarkState.t1.get();
    }

ThreadLocalTest.simpleScopedGet  avgt    3  5.447 ± 0.003  ns/op
ThreadLocalTest.simpleTLGet      avgt    3  7.273 ± 0.906  ns/op

2. This repeatedly get()s a value then stores it into an
AtomicReference. The use of AtomicRef.setOpaque forces all of the
stores to actually happen, preventing the optimizer from hoisting
get()s and set()s out of the loop:

    long floss1(ThreadLocal<Integer> t1) {
        long l = 0;
        for (int i = 0; i < 1_000; i++) {
            BenchmarkState.atomicRef.setOpaque(t1.get());
        }
        return l;
    }
    @Benchmark
        public void test1(BenchmarkState state) {
        floss1(state.t1);
    }

ThreadLocalTest.scopedTest1  avgt    3  1873.975 ±   87.717  ns/op
ThreadLocalTest.test1        avgt    3  5066.072 ± 3070.626  ns/op

3.  Much the same, but using six values:

    @Benchmark
    public void test2(BenchmarkState state) {
        for (int i = 0; i < 166; i++) {
            BenchmarkState.atomicRef.setOpaque(state.t1.get());
            BenchmarkState.atomicRef.setOpaque(state.t2.get());
            BenchmarkState.atomicRef.setOpaque(state.t3.get());
            BenchmarkState.atomicRef.setOpaque(state.t4.get());
            BenchmarkState.atomicRef.setOpaque(state.t5.get());
            BenchmarkState.atomicRef.setOpaque(state.t6.get());
        }
    }

ThreadLocalTest.scopedTest2  avgt    3  1919.582 ± 573.822  ns/op
ThreadLocalTest.test2        avgt    3  4845.745 ± 309.467  ns/op

This test is to make sure that the performance scales reasonably well
with increasing numbers of values.

4.  This get()s and sums a value a million times, first using a scoped
value then a ThreadLocal:

    @Benchmark
    public long summation(MyState state) {
        long sum = 0;
        for (int i = 0; i < 1_000_000; i++) {
            sum += BenchmarkState.t1.get();
        }
        return sum;
    }

ThreadLocalTest.scopedSummation    avgt    3   286998.680 ±   1122.577  ns/op
ThreadLocalTest.summation          avgt    3  3478884.173 ± 305545.897  ns/op

Note that the time using a scoped value is actually 0.2ns (!) per
iteration. This is because the scoped value is hoisted into a register
by JIT loop optimization, so the generated code for the loop looks
like:

           ;; B7: #	out( B7 B8 ) <- in( B6 B7 ) Loop( B7-B7 inner main of N66 strip mined) Freq: 9.73958e+11
         ↗  0x00007f8708456260:   add    rax,r10
  3.06%  │  0x00007f8708456263:   add    rax,r10
  2.25%  │  0x00007f8708456266:   add    rax,r10
  2.91%  │  0x00007f8708456269:   add    rax,r10
  3.17%  │  0x00007f870845626c:   add    rax,r10
  3.60%  │  0x00007f870845626f:   add    rax,r10
  2.76%  │  0x00007f8708456272:   add    rax,r10
  2.79%  │  0x00007f8708456275:   add    rax,r10
  2.93%  │  0x00007f8708456278:   add    rax,r10
  ...

5.  This benchmark increments an AtomicInteger (accessed via a scoped
value or a ThreadLocal) a million times:

    void task_counter() {
        AtomicInteger n = BenchmarkState.t7.get();
        n.setPlain(n.getPlain() + 1);
    }

    @Benchmark
        public long testCounter(MyState state) {
        long l = 0;
        for (int i = 0; i < 1_000_000; i++) {
            task_counter();
        }
        return BenchmarkState.t7.get().getPlain();
    }

ThreadLocalTest.scopedTestCounter  avgt    3        5.735 ±      0.026  ns/op
ThreadLocalTest.testCounter        avgt    3  3441076.699 ± 143503.626  ns/op

I was not expecting this, and had to check the result very carefully.
It's obviously not possible to count up to a million in 6 nanoseconds.

The scoped version is approx. a million times faster than the version
using a ThreadLocal! This perhaps requires a little explanation. When
using a scoped value, C2 can optimize more readily than ThreadLocals,
so it replaces the entire loop with an addition.

I'm sure this isn't typical usage, either for a ThreadLocal or a
scoped value, so the benchmark is of dubious value, but it does show
that an JIT compiler can do more optimization of scoped values.

Finally, full disclosure: I am accelerating scoped lookups by using a
16-entry (thread-local) cache to access to a small set of values. We
could use same technique with ThreadLocals, but we don't.

The cache doesn't always help but it rarely makes anything
significantly worse, although there is some small overhead for loading
it on a cache miss. I believe that, given the usual principle of
locality, this is a legitimate thing to do and is a worthwhile
optimization in most cases.

Here are the results from the set of tests, showing the scoped results
with and without the cache, the the same tests using ThreadLocals, all
in ns:

                                        with cache:      without:   ThreadLocal:

ThreadLocalTest.simpleGet    avgt    3        5.451         6.913          7.225

ThreadLocalTest.Test1        avgt    3     1928.372      4970.853       5116.356

ThreadLocalTest.Test2        avgt    3     1918.450      5020.094       4509.272

ThreadLocalTest.Summation    avgt    3  286,722.232   286,901.079  3,505,135.568

ThreadLocalTest.TestCounter  avgt    3        5.735         7.482  3,397,826.003

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671