RFR: 8251570: 8215624 causes assert(worker_id <' _n_workers) failed: Invalid worker_id

Fri Aug 14 18:25:37 UTC 2020

Hi all,

Further manual testing shows that this patch asserts when run with:
makec ../build/fastdebug-shenandoah/ test 
TEST="java/util/logging/TestLoggerWeakRefLeak.java" 
JTREG="JAVA_OPTIONS=-XX:+UseG1GC -XX:ParallelGCThreads=1"

#  guarantee(num_workers <= total_workers()) failed: Trying to execute 
task Iterating heap with 12 workers which is more than the amount of 
total workers 1.

V  [libjvm.so+0x19cf977]  WorkGang::run_task(AbstractGangTask*, unsigned 
int, bool)+0x457
V  [libjvm.so+0xcbf38a]  HeapInspection::populate_table(KlassInfoTable*, 
BoolObjectClosure*, unsigned int)+0x60a
V  [libjvm.so+0xcbf91c]  HeapInspection::heap_inspection(outputStream*, 
unsigned int)+0xcc
V  [libjvm.so+0xc44fa5]  VM_GC_HeapInspection::doit()+0x45

My patch fixes the problem when the heap inspection code decides to use 
fewer threads than the available worker threads, but it doesn't fix the 
case when the heap inspection tries to use more threads.

Without this patch, the code silently reverts to using few worker 
threads than the heap inspection requested. With the patch, we try to 
run the number of threads requested by the heap inspection code, and 
asserts that it won't work.

There needs to be some sort of agreement between the HeapInspection and 
the GC about how many threads to run.

I've taken a step back and completely removed the 
CollectedHeap::run_task code, since it's too fragile to use. Instead I 
let callers that want to fiddle with the number of worker threads, have 
to deal with the added complexity.

https://cr.openjdk.java.net/~stefank/8251570/webrev.02.delta/
https://cr.openjdk.java.net/~stefank/8251570/webrev.02/

You can almost see some traces of this problem in 
SafepointSynchronize::do_cleanup_tasks(), which can't simply use a 
CollectedHeap::run_task call because of the way ParallelSPCleanupTask 
needs to know the number of worker threads. The heap inspection problem 
is similar, but worse since it wants to decide for itself the number of 
worker threads, instead of relying on the current active number of workers.

This patch is increasingly becoming more and more GC agnostic, so I 
think it would be good if serviceability devs take a close look at this 
as well.

Note that even if the user tries to specify the number of threads to use 
for the parallel heap inspection, there's no guarantee at all that we'll 
use that many threads. I didn't change that with this patch, but I'm not 
sure if that was apparent for the authors and reviewers of 8215624.

I've only done limited testing on this patch on my computer. I'm 
currently running more extended testing.

Thanks,
StefanK

On 2020-08-14 16:40, Stefan Karlsson wrote:
> Hi all,
> 
> Please review this patch to fix a newly introduced bug where there can 
> be a mismatch between the number of parallel heap inspection threads and 
> the number of used GC threads
> 
> https://cr.openjdk.java.net/~stefank/8251570/webrev.01/
> https://bugs.openjdk.java.net/browse/JDK-8251570
> 
> The assert triggers because of this code:
>       ParallelObjectIterator* poi = 
> Universe::heap()->parallel_object_iterator(parallel_thread_num);
>       if (poi != NULL) {
>         ParHeapInspectTask task(poi, cit, filter);
>        Universe::heap()->run_task(&task);
> 
> G1's parallel_object_iterator sets up a HeapRegionClaimer with 
> parallel_thread_num, while run_task uses active_workers() of the used 
> WorkGang.
> 
> The original patch also has another, latent bug. Both ZGC and Shenandoah 
> uses the wrong WorkGang in run_task. Currently, non of those GCs provide 
> a parallel_object_iterator, so the code isn't executed ATM.
> 
> The proposed fix for this is to simplify the code and introduce a 
> concrete function CollectedHeap::run_task_at_safepoint that takes the 
> number of requested workers as a parameter, and dispatches the task 
> through the CollectedHeap::get_safepoint_workers() workers (if provided 
> by the GC).
> 
> Testing:
> - Locally with the failing tests and all GCs
> - tier1-tier5 on Linux
> 
> Thanks,
> StefanK