RFR: 8306738: Select num workers for safepoint ParallelCleanupTask [v2]

Thu Apr 27 08:01:04 UTC 2023

On Tue, 25 Apr 2023 20:37:48 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> I still believe we can do it significantly better! The key here is not only minimizing the parallelism beyond the number of tasks, but also recognize that most tasks are so light-weight, they do not need a separate worker thread. In fact, if we estimate we execute _only_ the light-weight tasks, we might as well do things serially without any thread handoff.
> 
> For example, my crude experiment (patch below) cuts down the safepoint cleanup time from ~40 us to ~1 us on my M1 Mac running G1 with `while (true) sink = new int[10000]` with 1 GB heap. The cleanup time outliers up to 200us are present in current baseline, completely gone with the patch. The whole G1 collection safepoint takes ~700 us in that experiment, so that improvement is quite palpable: before the patch, the cleanup took about 5% of total safepoint time on average and up to 30% in spikes.
> 
> Pretty sure we can polish it even further.
> 
> ```
> diff --git a/src/hotspot/share/runtime/safepoint.cpp b/src/hotspot/share/runtime/safepoint.cpp
> index 2ff593a0143..64dcd0202ae 100644
> --- a/src/hotspot/share/runtime/safepoint.cpp
> +++ b/src/hotspot/share/runtime/safepoint.cpp
> @@ -547,6 +547,19 @@ public:
>      _do_lazy_roots(!VMThread::vm_operation()->skip_thread_oop_barriers() &&
>                     Universe::heap()->uses_stack_watermark_barrier()) {}
>  
> +  size_t expected_workers() {
> +    // This worker does the light-weight tasks and guarantees progress
> +    size_t workers = 1;
> +
> +    // Potential heavy-weights, allocate another worker for them
> +    // TODO: InlineCacheBuffer should be here too.
> +    if (_do_lazy_roots) {
> +      workers++;
> +    }
> +
> +    return workers;
> +  }
> +
>    void work(uint worker_id) {
>      // These tasks are ordered by relative length of time to execute so that potentially longer tasks start first.
>      if (_subtasks.try_claim_task(SafepointSynchronize::SAFEPOINT_CLEANUP_SYMBOL_TABLE_REHASH)) {
> @@ -601,9 +614,11 @@ void SafepointSynchronize::do_cleanup_tasks() {
>    assert(heap != nullptr, "heap not initialized yet?");
>    ParallelCleanupTask cleanup;
>    WorkerThreads* cleanup_workers = heap->safepoint_workers();
> -  if (cleanup_workers != nullptr) {
> +  size_t num_workers = cleanup.expected_workers();
> +  if (num_workers > 1 && cleanup_workers != nullptr) {
>      // Parallel cleanup using GC provided thread pool.
> -    cleanup_workers->run_task(&cleanup);
> +    num_workers = MIN2<uint>(cleanup_workers->active_workers(), num_workers);
> +    cleanup_workers->run_task(&cleanup, num_workers);
>    } else {
>      // Serial cleanup using VMThread.
>      cleanup.work(0);
> ```
> 
> This could be done in a separate PR, of course, but I don't see why not test some approaches here :)

Sounds promising. I can look into this in this PR.

We essentially want to find which tasks are cheaper than spinning up worker threads. And when the parallel work can amortise the spinning up worker threads cost.

Given that this comment exists `// These tasks are ordered by relative length of time to execute so that potentially longer tasks start first.` With the order:
1.  `SAFEPOINT_CLEANUP_SYMBOL_TABLE_REHASH`
2.  `SAFEPOINT_CLEANUP_STRING_TABLE_REHASH`
3.  `SAFEPOINT_CLEANUP_LAZY_ROOT_PROCESSING`
4.  `SAFEPOINT_CLEANUP_UPDATE_INLINE_CACHES`
5.  `SAFEPOINT_CLEANUP_REQUEST_OOPSTORAGE_CLEANUP`

Given that you @shipilev have 3 and 4 as heavier tasks than 1 and 2, and some other original developer had a different estimation, it would probably be better to have some more dynamic estimator for the actual work of each task.

But it is probably worthwhile here to have the `ParallelCleanupTask` query the subsystems at creation and figure out before we launch the task how many different tasks to perform, and then start the tasks based on this. It might be likely that in most cases <2 cleanup task have to be performed and worker threads are unnecessary. It also makes it possible to have the `task needs cleanup` property be stable over the lifetime of the ParallelCleanupTask.

Also if we have cleanup tasks that should always be lumped together into one parallel task should probably add some `enum SafepointParallelCleanupTasks` and claim based on these.

So we need to investigate the cost of each operation, the possibility (and granularity) of estimating the work for each task and when the cost of spinning up worker threads is amortised by the parallel speedup.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13616#issuecomment-1525016442