RFR(L): 8195097: Provide hashtable with low latency and high concurrency

Tue Feb 20 13:33:47 UTC 2018

Hi all, please review.

The lower latency of the new gcs have higher demands on runtime data-structure in
terms of latency. This is a concurrent hashtable using a global-counter for RCU semantics.

Webrev:
http://cr.openjdk.java.net/~rehn/8195097/webrev/

Bug:
https://bugs.openjdk.java.net/browse/JDK-8195097

- It does concurrent re-size by doubling size and use one more bit from hash.
That means a bucket in the old table contains nodes to either one of two buckets
in the new table. Each node in the chain is sorted into one of the two buckets
with concurrent readers. It possible to do concurrent insert also, but not
implemented in this change-set. Shrinking just take two node chains and put them
together in the corresponding bucket.

- To keep track of what is visible to the concurrent readers it's uses a
global-counter with RCU semantics, needed during re-size and for deletes.

- Readers never blocks or spins.
(The ballpark numbers: mean: 157.453 stddev: 58.9686 var:3477.3 ns)

- Writer do CAS insert, but need to spin during re-size of targeted bucket or
  insert collision.

- Operations that loop over all bucket are serialized in this change-set.

- The global-counter uses a counter to determine current generation. Any reader
needs to have a local counter for which generation is currently read. By
increment the global-counter and scan for threads reading an old generation
and wait for them to complete, we know when an old generation is not visible
(no pre-existing reader). In RCU terms, this creates a grace-period. Since the
scanning of thread must be lock-free SMR-Threads-list was a prerequisite. We
can thus only scan JavaThreads, and if needed threads that never exits could be
included, like the VMThread.

A follow-up change-set will make it possible to do operations that loop over
all bucket, e.g. re-size, scan, bulk delete with multiple threads. Potential
also make it possible to do them in parallel, e.g. bulk delete while re-sizing.

Two enhancement to the global-counter will be looked into:
- Quiescent state RCU semantic by using the natural state of JavaThreads in the VM.
- Asynchronous write synchronize, where reclamation, if there are per-existing
reader, is done by the last reader leaving that generation.

Several people have done pre-reviews, thanks and CC:ed.

Multiple parallel instances of the new gtest have being running many days and
GTestWrapper.java passes on all our platforms. I also tested aarch64, _but_
I only had access to Cortex A53 which lacks out-of-order execution AFAIK.

The OrderAccess pattern is stricter than truly needed, for weaker (non-TSO) platforms
we might need to make it less strict for performance. Let's hope not.

Thanks, Robbin