RFR (M): Thread-local buffers for liveness data
Aleksey Shipilev
shade at redhat.com
Wed Jan 4 22:23:28 UTC 2017
Hi,
We know from mark-compact performance work that liveness computation takes a
non-negligible part of marking time.
If you look into profiles for the application with large dataset, then you can
clearly see the atomic "lock xadd" from SHRegion::increase_live_data in
hotspots. It is a hotspot for both plain latency and contention reasons, even on
a moderately sized x86.
Let's upgrade the one-slot cache into the full-blown thread-local buffers for
liveness data:
http://cr.openjdk.java.net/~shade/shenandoah/liveness-threadlocal/webrev.01/
Observations:
a) One-slot cache gives ~20-40% cache hit rate on most workloads. Which means
every second object does the atomic xadd. My attempts in doing smarter
N-slot/history caching were not fruitful: the long tail flaps happily all over
the place.
b) size_t and jint are overkill for the table. Each thread would potentially
touch ${regions}*${sizeof(element)}-sized local table. On my machine, 2K size_t
adds up to 16KB, which is half of L1. With jushort, it is only 4KB. In reality,
most threads would touch only a few elements, and touch the atomic add on rare
overflows.
c) Switching live_data from bytes to HeapWords helps to expand the buffering
capacity.
d) With 8 threads, we take up 4*8 = +32KB of additional space. I would expect
that our region count to grow sub-linearly with thread counts, and so for 128
threads, it would be +512KB for all threads.
e) Performance-wise, SPECjvm2008 is not affected (LDS is way too low);
f) Mark tests that retain large object graphs benefit a lot. With "aggressive"
heuristics, and large tree with 10M nodes:
Baseline, conc mark times:
35.99 s (avg = 105.24 ms) (num = 342)
35.90 s (avg = 108.47 ms) (num = 331)
35.98 s (avg = 103.69 ms) (num = 347)
36.08 s (avg = 104.89 ms) (num = 344)
36.09 s (avg = 104.90 ms) (num = 344)
Patched, conc mark times:
33.68 s (avg = 83.37 ms) (num = 404)
33.69 s (avg = 84.64 ms) (num = 398)
33.67 s (avg = 83.77 ms) (num = 402)
33.71 s (avg = 82.01 ms) (num = 411)
33.65 s (avg = 85.41 ms) (num = 394)
(lower times => more frequent marks under "aggressive")
Testing: hotspot_gc_shenandoah, SPECjvm2008, targeted benchmarks
Thanks,
-Aleksey
More information about the shenandoah-dev
mailing list