RFR: 8309404: Parallel: Process class loader data graph in parallel in young gc

Wed Jun 7 09:13:54 UTC 2023

On Tue, 6 Jun 2023 09:03:47 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>> This patch parallelizes the process of the class loader data graph in young gc.
>> 
>> The class `ClassLoaderData` has a field `_claim` to avoid applying oop closure more than once 
>> and the method `ClassLoaderData::oops_do` can check if the CLD had been claimed.
>> The parallel full gc has already used them in `MarkFromRootsTask::work` and `PSAdjustTask::work`.
>> 
>> But I don't have experience to test/verify the performance improvement of the GC.
>> If this patch needs such test data before integrating, please guide and help me here.
>> 
>> Thanks for the review and guidance.
>> 
>> Best Regards,
>> -- Guoxiong
>
> The reason why this is not parallelized in stw pauses is because the CLD data structure is not amenable to parallelization at all. This is basically a linked list, and when parallelizing it this way, 
> * every thread visits every CLD anyway (doing the pointer chasing)
> * threads are massively choking themselves on obtaining the claim value
> * the code adds another pass through the CLD linked list clearing the claim marks
> In my experience you *will* get significant negative scaling (i.e. that phase taking significant multiples of the original time) with this simple approach.
> 
> I.e. this has been analyzed before, see https://bugs.openjdk.org/browse/JDK-8030144; only Shenandoah does a parallel walk, but with limited number of threads (e.g. https://bugs.openjdk.org/browse/JDK-8246097)
> 
> The latter CR also provides some applications with apparently many CLDG entries (Spring Boot and CLion), which may be used for this investigation.
> Fwiw, it may be easier to start this investigation with G1 as it has the timing logging already implmented (but it is no problem to add this to Parallel temporarily(?)).

@tschatzl I have a question. During searching the roots in parallel, the claim values of the threads also need to be changed atomically by all the worker threads. Does it also have the same problem? Do you know/do any research related to it?

// file threads.cpp
void Threads::possibly_parallel_threads_do(bool is_par, ThreadClosure* tc) {
  assert_at_safepoint();

  uintx claim_token = Threads::thread_claim_token();
  ALL_JAVA_THREADS(p) {
    if (p->claim_threads_do(is_par, claim_token)) {
      tc->do_thread(p);
    }
  }
  for (NonJavaThread::Iterator njti; !njti.end(); njti.step()) {
    Thread* current = njti.current();
    if (current->claim_threads_do(is_par, claim_token)) {
      tc->do_thread(current);
    }
  }
}

// file thread.cpp
bool Thread::claim_par_threads_do(uintx claim_token) {
  uintx token = _threads_do_token;
  if (token != claim_token) {
    uintx res = Atomic::cmpxchg(&_threads_do_token, token, claim_token);
    if (res == token) {
      return true;
    }
    guarantee(res == claim_token, "invariant");
  }
  return false;
}

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14297#issuecomment-1580259670