RFR: 8309404: Parallel: Process class loader data graph in parallel in young gc

Wed Jun 7 10:27:55 UTC 2023

On Tue, 6 Jun 2023 09:03:47 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>> This patch parallelizes the process of the class loader data graph in young gc.
>> 
>> The class `ClassLoaderData` has a field `_claim` to avoid applying oop closure more than once 
>> and the method `ClassLoaderData::oops_do` can check if the CLD had been claimed.
>> The parallel full gc has already used them in `MarkFromRootsTask::work` and `PSAdjustTask::work`.
>> 
>> But I don't have experience to test/verify the performance improvement of the GC.
>> If this patch needs such test data before integrating, please guide and help me here.
>> 
>> Thanks for the review and guidance.
>> 
>> Best Regards,
>> -- Guoxiong
>
> The reason why this is not parallelized in stw pauses is because the CLD data structure is not amenable to parallelization at all. This is basically a linked list, and when parallelizing it this way, 
> * every thread visits every CLD anyway (doing the pointer chasing)
> * threads are massively choking themselves on obtaining the claim value
> * the code adds another pass through the CLD linked list clearing the claim marks
> In my experience you *will* get significant negative scaling (i.e. that phase taking significant multiples of the original time) with this simple approach.
> 
> I.e. this has been analyzed before, see https://bugs.openjdk.org/browse/JDK-8030144; only Shenandoah does a parallel walk, but with limited number of threads (e.g. https://bugs.openjdk.org/browse/JDK-8246097)
> 
> The latter CR also provides some applications with apparently many CLDG entries (Spring Boot and CLion), which may be used for this investigation.
> Fwiw, it may be easier to start this investigation with G1 as it has the timing logging already implmented (but it is no problem to add this to Parallel temporarily(?)).

> @tschatzl I have a question. During searching the roots in parallel, the claim values of the threads also need to be changed atomically by all the worker threads. Does it also have the same problem? Do you know/do any research related to it?
> 

It generally has the same problem, but is mitigated because the work per Thread is much larger (a Thread has typically many stacks levels to look through), so the relative impact smaller. The amount of contention on that is much lower too as the probability that multiple threads try to claim the same claim value is lower due to longer and more variable time to process the task.

There is also a newer API available that allows less contention claiming, see e.g. `G1JavaThreadsListClaimer` and the use in `G1PreEvacuateCollectionSetBatchTask`, however at this point it is only used in a few places. Replacement work isn't trivial.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14297#issuecomment-1580439963