RFR: 8200557: OopStorage parallel iteration scales poorly
Erik Österlund
erik.osterlund at oracle.com
Thu Apr 26 13:36:31 UTC 2018
Hi Kim,
FIrst off, some high level comments:
1) I would have simply grabbed the allocation mutex when incrementing
the reference counter when starting concurrent iteration, instead of
using RCU. The path where we grab this active list for concurrent
iteration is stone cold, so I don't expect there to be any observable
benefit in using RCU here. But I am going to let that pass and just note
we think differently about things here, and I think that is okay.
2) It is a bit unfortunate that going down the RCU path led to yet
another implementation of RCU because GlobalCounter can not yet satisfy
your scenario. But I think that you have done a good job making the RCU
implementation pluggable in OopStorage, and we can easily remove it once
the ThreadsListHandle required by GlobalCounter starts supporting
lock-free nesting soon. I am speculating that your approach could be
folded in to the GlobalCounter, to accomodate non-JavaThreads and non-VM
threads, which that solution currently does not support. But perhaps
that is a discussion for later.
Low level comments:
* Noting that the new RCU mechanism will like the last one not work on
PPC until there is an atomic counter increment with more conservative
memory ordering. But I won't blame you for this.
In oopStorage.cpp:
879 size_t OopStorage::block_count() const {
880 WithActiveArray wab(this);
881 return wab.active_array().block_count_acquire();
882 }
Why do you need acquire here? I don't see subsequent loads into the
elements of the array, which seems to be what the paired release_store
when writing the counter protects against.
884 size_t OopStorage::total_memory_usage() const {
885 size_t total_size = sizeof(OopStorage);
886 total_size += strlen(name()) + 1;
887 total_size += sizeof(BlockArray);
888 WithActiveArray wab(this);
889 const BlockArray& blocks = wab.active_array();
890 total_size += blocks.block_count_acquire() *
Block::allocation_size();
891 total_size += blocks.size() * sizeof(Block*);
892 return total_size;
893 }
Same as above: what reordering is the block_count_acquire protecting
against? No element in the array is read after the acquire in program
order, that I can see.
760 _name(dup_name(name)),
761 _active_array(BlockArray::create(initial_active_array_size)),
762 _allocate_list(&Block::get_allocate_entry),
763 _deferred_updates(NULL),
764 _allocate_mutex(allocate_mutex),
765 _active_mutex(active_mutex),
766 _allocation_count(0),
767 _concurrent_iteration_active(false)
768 {
769 _active_array->increment_refcount();
I wonder if you could make BlockArray::create() always return an already
retained block array, instead of updating it after.
496 BlockArray* new_array = BlockArray::create(new_size);
497 if (new_array == NULL) return false;
498 new_array->copy_from(old_array);
499 replace_active_array(new_array);
And same here where a block array is created without retain reference
counter, which is then incremented manually inside of
replace_active_array. Seems to me like it would be easier to make
BlockArray::create return already retained block arrays.
Thanks,
/Erik
On 2018-04-19 10:18, Kim Barrett wrote:
> Please review this change to OopStorage parallel iteration to improve
> the scaling with additional threads.
>
> Two sources of poor scaling were found: (1) contention when claiming
> blocks, and (2) each worker thread ended up touching the majority of
> the blocks, even those not processed by that thread.
>
> To address this, we changed the representation of the sequence of all
> blocks. Rather than being a doubly-linked intrusive list linked
> through the blocks, it is now an array of pointers to blocks. We use
> a combination of refcounts and an RCU-inspired mechanism to safely
> manage the array storage when it needs to grow, avoiding the need to
> lock access to the array while performing concurrent iteration.
>
> The use of an array for the sequence of all blocks permits parallel
> iteration to claim ranges of indices using Atomic::add, which can be
> more efficient on some platforms than using cmpxchg loops. It also
> allows a worker thread to only touch exactly those blocks it is going
> to process, rather than walking a list of blocks. The only
> complicating factor is that we have to account for possible overshoot
> in a claim attempt.
>
> Blocks know their position in the array, to facilitate empty block
> deletion (an empty block might be anywhere in the active array, and we
> don't want to have to search for it). This also helps with
> allocation_status, eliminating the verification search that was needed
> with the list representation. allocation_status is now constant-time,
> which directly benefits -Xcheck:jni.
>
> A new gtest-based performance demonstration is included. It's not
> really a test, in that it doesn't do any verification. Rather, it
> performs parallel iteration and reports total time, per-thread times,
> and per-thread percentage of blocks processed. This is done for a
> variety of thread counts, to show the parallel speedup and load
> balancing. Running on my dual 6 core Xeon, I'm seeing more or less
> linear speedup for up to 10 threads processing 1M OopStorage entries.
>
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8200557
>
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8200557/open.00/
>
> Testing:
> jdk-tier{1-3}, hs-tier{1-5}, on all Oracle supported platforms
>
More information about the hotspot-dev
mailing list