RFR: 8301116: Parallelize TLAB resizing in G1

Thu Feb 2 09:55:25 UTC 2023

On Thu, 2 Feb 2023 09:39:32 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> To be more exact, the removal of the const happens in `g1CollectedHeap.inline.hpp:124`:
>> 
>> inline JavaThread* const* G1JavaThreadsListClaimer::claim(uint& count) {
>>   count = 0;
>>   if (Atomic::load(&_cur_claim) >= _list.length()) {
>>     return nullptr;
>>   }
>>   uint claim = Atomic::fetch_and_add(&_cur_claim, _claim_step);
>>   if (claim >= _list.length()) {
>>     return nullptr;
>>   }
>>   count = MIN2(_list.length() - claim, _claim_step);
>>   return _list.list()->threads() + claim;                         <--- here
>> }
>> 
>> because of the mentioned access in `g1YoungGCPostEvacuateTasks.cpp:720`.
>
> Using `Threads::possibly_parallel_oops_do`, thread iteration/claiming seems to be the bottleneck, i.e. claiming the token.
> 
> One issue is that all threads need to traverse the array from the beginning to get to the current claim position - so the minimum processing time for a single thread is iterating the complete thread array and checking all tokens for it.  That does not matter so much if the work per thread is big (like when walking the stacks for oops - but even then I think it would be noticable, need to check).
> 
> The other is that with little work per thread threads like in this situation they seem to be contending heavily on the JavaThread claim tokens, so this "parallelization" is quite a pessimization - I've measured ~4x slower with 18 threads than using the single-threaded version (on ~21k JavaThreads)

It looks like this has been a leftover of some older version (the `G1JavaThreadClaimer` is fairly "new") - I will remove this change.
Thanks for making me look at this again in detail!

-------------

PR: https://git.openjdk.org/jdk/pull/12360