RFR (M): Thread-local buffers for liveness data

Wed Jan 4 22:23:28 UTC 2017

Hi,

We know from mark-compact performance work that liveness computation takes a
non-negligible part of marking time.

If you look into profiles for the application with large dataset, then you can
clearly see the atomic "lock xadd" from SHRegion::increase_live_data in
hotspots. It is a hotspot for both plain latency and contention reasons, even on
a moderately sized x86.

Let's upgrade the one-slot cache into the full-blown thread-local buffers for
liveness data:
  http://cr.openjdk.java.net/~shade/shenandoah/liveness-threadlocal/webrev.01/

Observations:

 a) One-slot cache gives ~20-40% cache hit rate on most workloads. Which means
every second object does the atomic xadd. My attempts in doing smarter
N-slot/history caching were not fruitful: the long tail flaps happily all over
the place.

 b) size_t and jint are overkill for the table. Each thread would potentially
touch ${regions}*${sizeof(element)}-sized local table. On my machine, 2K size_t
adds up to 16KB, which is half of L1. With jushort, it is only 4KB. In reality,
most threads would touch only a few elements, and touch the atomic add on rare
overflows.

 c) Switching live_data from bytes to HeapWords helps to expand the buffering
capacity.

 d) With 8 threads, we take up 4*8 = +32KB of additional space. I would expect
that our region count to grow sub-linearly with thread counts, and so for 128
threads, it would be +512KB for all threads.

 e) Performance-wise, SPECjvm2008 is not affected (LDS is way too low);

 f) Mark tests that retain large object graphs benefit a lot. With "aggressive"
heuristics, and large tree with 10M nodes:

Baseline, conc mark times:
  35.99 s (avg =   105.24 ms)  (num =   342)
  35.90 s (avg =   108.47 ms)  (num =   331)
  35.98 s (avg =   103.69 ms)  (num =   347)
  36.08 s (avg =   104.89 ms)  (num =   344)
  36.09 s (avg =   104.90 ms)  (num =   344)

Patched, conc mark times:
  33.68 s (avg =    83.37 ms)  (num =   404)
  33.69 s (avg =    84.64 ms)  (num =   398)
  33.67 s (avg =    83.77 ms)  (num =   402)
  33.71 s (avg =    82.01 ms)  (num =   411)
  33.65 s (avg =    85.41 ms)  (num =   394)

(lower times => more frequent marks under "aggressive")

Testing: hotspot_gc_shenandoah, SPECjvm2008, targeted benchmarks

Thanks,
-Aleksey