RFR: 8200557: OopStorage parallel iteration scales poorly

Thu Apr 26 13:36:31 UTC 2018

Hi Kim,

FIrst off, some high level comments:

1) I would have simply grabbed the allocation mutex when incrementing 
the reference counter when starting concurrent iteration, instead of 
using RCU. The path where we grab this active list for concurrent 
iteration is stone cold, so I don't expect there to be any observable 
benefit in using RCU here. But I am going to let that pass and just note 
we think differently about things here, and I think that is okay.

2) It is a bit unfortunate that going down the RCU path led to yet 
another implementation of RCU because GlobalCounter can not yet satisfy 
your scenario. But I think that you have done a good job making the RCU 
implementation pluggable in OopStorage, and we can easily remove it once 
the ThreadsListHandle required by GlobalCounter starts supporting 
lock-free nesting soon. I am speculating that your approach could be 
folded in to the GlobalCounter, to accomodate non-JavaThreads and non-VM 
threads, which that solution currently does not support. But perhaps 
that is a discussion for later.

Low level comments:
* Noting that the new RCU mechanism will like the last one not work on 
PPC until there is an atomic counter increment with more conservative 
memory ordering. But I won't blame you for this.

In oopStorage.cpp:

  879 size_t OopStorage::block_count() const {
  880   WithActiveArray wab(this);
  881   return wab.active_array().block_count_acquire();
  882 }

Why do you need acquire here? I don't see subsequent loads into the 
elements of the array, which seems to be what the paired release_store 
when writing the counter protects against.

  884 size_t OopStorage::total_memory_usage() const {
  885   size_t total_size = sizeof(OopStorage);
  886   total_size += strlen(name()) + 1;
  887   total_size += sizeof(BlockArray);
  888   WithActiveArray wab(this);
  889   const BlockArray& blocks = wab.active_array();
  890   total_size += blocks.block_count_acquire() * 
Block::allocation_size();
  891   total_size += blocks.size() * sizeof(Block*);
  892   return total_size;
  893 }

Same as above: what reordering is the block_count_acquire protecting 
against? No element in the array is read after the acquire in program 
order, that I can see.

  760   _name(dup_name(name)),
  761   _active_array(BlockArray::create(initial_active_array_size)),
  762   _allocate_list(&Block::get_allocate_entry),
  763   _deferred_updates(NULL),
  764   _allocate_mutex(allocate_mutex),
  765   _active_mutex(active_mutex),
  766   _allocation_count(0),
  767   _concurrent_iteration_active(false)
  768 {
  769   _active_array->increment_refcount();

I wonder if you could make BlockArray::create() always return an already 
retained block array, instead of updating it after.

  496   BlockArray* new_array = BlockArray::create(new_size);
  497   if (new_array == NULL) return false;
  498   new_array->copy_from(old_array);
  499   replace_active_array(new_array);

And same here where a block array is created without retain reference 
counter, which is then incremented manually inside of 
replace_active_array. Seems to me like it would be easier to make 
BlockArray::create return already retained block arrays.

Thanks,
/Erik

On 2018-04-19 10:18, Kim Barrett wrote:
> Please review this change to OopStorage parallel iteration to improve
> the scaling with additional threads.
>
> Two sources of poor scaling were found: (1) contention when claiming
> blocks, and (2) each worker thread ended up touching the majority of
> the blocks, even those not processed by that thread.
>
> To address this, we changed the representation of the sequence of all
> blocks.  Rather than being a doubly-linked intrusive list linked
> through the blocks, it is now an array of pointers to blocks.  We use
> a combination of refcounts and an RCU-inspired mechanism to safely
> manage the array storage when it needs to grow, avoiding the need to
> lock access to the array while performing concurrent iteration.
>
> The use of an array for the sequence of all blocks permits parallel
> iteration to claim ranges of indices using Atomic::add, which can be
> more efficient on some platforms than using cmpxchg loops.  It also
> allows a worker thread to only touch exactly those blocks it is going
> to process, rather than walking a list of blocks.  The only
> complicating factor is that we have to account for possible overshoot
> in a claim attempt.
>
> Blocks know their position in the array, to facilitate empty block
> deletion (an empty block might be anywhere in the active array, and we
> don't want to have to search for it).  This also helps with
> allocation_status, eliminating the verification search that was needed
> with the list representation.  allocation_status is now constant-time,
> which directly benefits -Xcheck:jni.
>
> A new gtest-based performance demonstration is included. It's not
> really a test, in that it doesn't do any verification.  Rather, it
> performs parallel iteration and reports total time, per-thread times,
> and per-thread percentage of blocks processed.  This is done for a
> variety of thread counts, to show the parallel speedup and load
> balancing.  Running on my dual 6 core Xeon, I'm seeing more or less
> linear speedup for up to 10 threads processing 1M OopStorage entries.
>
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8200557
>
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8200557/open.00/
>
> Testing:
> jdk-tier{1-3}, hs-tier{1-5}, on all Oracle supported platforms
>