Scoped values

Mon Sep 16 18:05:48 UTC 2019

Here's a proof-of-concept implementation of scoped values to play
with.

Download:
The patch and the benchmark are in http://cr.openjdk.java.net/~aph/scoped/

The API is very simple. Create a Scoped object with, e.g.

  static final Scoped<> myScoped1 = Scoped.forType(Integer.class);

Bind your scoped object to a a scope with e.g.

   myVar.bind(33, () -> {
     System.out.println(myVar.get());
   }

... which should print 33. The lifetime of the binding is the lifetime
of the enclosed scope, and if you try to get() a value outside its
scope you'll get an exception. There is no put(): you can re-bind a
value if you want, but that creates a new binding scope which hides
the old value. That value will become visible again when the inner
scope ends.

And that is, more or less, the extent of the API.

Here's a full example:

public class Hello
{
    static final Scoped<String> myScopedValue = Scoped.forType(String.class);

    public static void main(String[] args) {
        myScopedValue.bind("Hello, World!", () -> {
            hello();
        });
    }

    static void hello() {
        System.out.println(myScopedValue.get());
    }
}

How it works:

When a value is bound, an instance of ScopedBinding is created. The
bind() method adds the bound value to a ScopedMap, which is a hash
table which maps from Scoped<T> to T. This table expands as required.

[ In this PoC implementation the ScopedMap is local to a Thread, but I
guess that in Project Loom each Continuation will maintain a ScopedMap
of the bindings which exist in its stack. When a hierarchy of
continuations is mounted on the native stack, we can scan them one by
one. There are other ways to do the search: we could even walk the
stack, or we could maintain a tree of bindings. Which is most
performant depends on the relative frequency of bind() to get(). ]

Also, for performance reasons, there is a small thread-local cache of
bindings. Successful lookups in this cache are considerably faster
than searching ScopedMaps, and the cost of maintaining it is
slight. (How fast is the cache? It depends. If everything gets
inlined, the code is small enough, and values are repeatedly accessed,
C2 will hoist cache entries into *registers*.)

If we're rarely using a scoped value the cost of maintaining
this cache will be very small, but the performance increase when we
frequently use a value will be great. As with all caches, if your
working set of values is large enough to thrash the cache then
performance will suffer.

The performance advantage of scoped values over ThreadLocals varies
from around a factor of 2 to billions: C2 can analyse scoped accesses
much more readily than ThreadLocals. With C1 and the interpreter the
speedup is less dramatic but still substantial. There is scope for yet
more optimization, but I had to stop somewhere.

Benchmark results, "SC" for Scoped, "TL" for ThreadLocal:

(C2 compiler)
Benchmark                            Mode  Cnt         Score        Error  Units
ThreadLocalTest.counterSC            avgt    3         7.742 ±      0.033  ns/op
ThreadLocalTest.counterTL            avgt    3  13562023.214 ± 357446.858  ns/op
ThreadLocalTest.getSC                avgt    3         5.447 ±      0.010  ns/op
ThreadLocalTest.getTL                avgt    3        10.894 ±      0.043  ns/op
ThreadLocalTest.summationSC          avgt    3    289371.352 ±  10586.007  ns/op
ThreadLocalTest.summationTL          avgt    3   5936803.887 ± 585718.637  ns/op
ThreadLocalTest.thousandGetsMultiSC  avgt    3      3184.632 ±    152.276  ns/op
ThreadLocalTest.thousandGetsMultiTL  avgt    3      8023.575 ±    147.210  ns/op
ThreadLocalTest.thousandGetsSC       avgt    3      3497.657 ±    152.637  ns/op
ThreadLocalTest.thousandGetsTL       avgt    3      8515.073 ±   1380.751  ns/op

(C1 compiler only)
Benchmark                            Mode  Cnt         Score        Error  Units
ThreadLocalTest.counterSC            avgt    3   2729177.283 ± 139922.430  ns/op
ThreadLocalTest.counterTL            avgt    3  27265732.333 ± 164437.245  ns/op
ThreadLocalTest.getSC                avgt    3         6.310 ±      0.003  ns/op
ThreadLocalTest.getTL                avgt    3        18.065 ±      0.022  ns/op
ThreadLocalTest.summationSC          avgt    3   2737301.824 ±  92434.595  ns/op
ThreadLocalTest.summationTL          avgt    3  12904368.400 ±  19691.899  ns/op
ThreadLocalTest.thousandGetsMultiSC  avgt    3      9537.571 ±     28.315  ns/op
ThreadLocalTest.thousandGetsMultiTL  avgt    3     16085.950 ±    131.427  ns/op
ThreadLocalTest.thousandGetsSC       avgt    3      8563.665 ±    904.076  ns/op
ThreadLocalTest.thousandGetsTL       avgt    3     16384.927 ±     22.803  ns/op

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671