RFR: 8200557: OopStorage parallel iteration scales poorly

Wed Apr 25 15:02:42 UTC 2018

Anyone looking at this yet?

> On Apr 19, 2018, at 4:18 AM, Kim Barrett <kim.barrett at oracle.com> wrote:
> 
> Please review this change to OopStorage parallel iteration to improve
> the scaling with additional threads.
> 
> Two sources of poor scaling were found: (1) contention when claiming
> blocks, and (2) each worker thread ended up touching the majority of
> the blocks, even those not processed by that thread.
> 
> To address this, we changed the representation of the sequence of all
> blocks.  Rather than being a doubly-linked intrusive list linked
> through the blocks, it is now an array of pointers to blocks.  We use
> a combination of refcounts and an RCU-inspired mechanism to safely
> manage the array storage when it needs to grow, avoiding the need to
> lock access to the array while performing concurrent iteration.
> 
> The use of an array for the sequence of all blocks permits parallel
> iteration to claim ranges of indices using Atomic::add, which can be
> more efficient on some platforms than using cmpxchg loops.  It also
> allows a worker thread to only touch exactly those blocks it is going
> to process, rather than walking a list of blocks.  The only
> complicating factor is that we have to account for possible overshoot
> in a claim attempt.
> 
> Blocks know their position in the array, to facilitate empty block
> deletion (an empty block might be anywhere in the active array, and we
> don't want to have to search for it).  This also helps with
> allocation_status, eliminating the verification search that was needed
> with the list representation.  allocation_status is now constant-time,
> which directly benefits -Xcheck:jni.
> 
> A new gtest-based performance demonstration is included. It's not
> really a test, in that it doesn't do any verification.  Rather, it
> performs parallel iteration and reports total time, per-thread times,
> and per-thread percentage of blocks processed.  This is done for a
> variety of thread counts, to show the parallel speedup and load
> balancing.  Running on my dual 6 core Xeon, I'm seeing more or less
> linear speedup for up to 10 threads processing 1M OopStorage entries.
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8200557
> 
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8200557/open.00/
> 
> Testing:
> jdk-tier{1-3}, hs-tier{1-5}, on all Oracle supported platforms