RFR: 8306738: Select num workers for safepoint ParallelCleanupTask [v2]

Tue Apr 25 20:43:08 UTC 2023

On Tue, 25 Apr 2023 14:34:06 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

>> Select num workers for safepoint ParallelCleanupTask to be at most the number of parallel cleanup tasks `SAFEPOINT_CLEANUP_NUM_TASKS`
>> 
>> Unsure if `WorkerThreads::max_workers()` is more correct than `WorkerThreads::created_workers()` or `WorkerThreads::active_workers()`. 
>> 
>> Tested tier1-3 Oracle platforms.
>
> Axel Boldt-Christmas has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Use active_workers

I still believe we can do it significantly better! The key here is not only minimizing the parallelism beyond the number of tasks, but also recognize that most tasks are so light-weight, they do not need a separate worker thread. In fact, if we estimate we execute _only_ the light-weight tasks, we might as well do things serially without any thread handoff.

For example, my crude experiment (patch below) cuts down the safepoint cleanup time from ~40 us to ~1 us on my M1 Mac running G1 with `while (true) sink = new int[10000]` with 1 GB heap. Pretty sure we can polish it even further.

diff --git a/src/hotspot/share/runtime/safepoint.cpp b/src/hotspot/share/runtime/safepoint.cpp
index 2ff593a0143..64dcd0202ae 100644
--- a/src/hotspot/share/runtime/safepoint.cpp
+++ b/src/hotspot/share/runtime/safepoint.cpp
@@ -547,6 +547,19 @@ public:
     _do_lazy_roots(!VMThread::vm_operation()->skip_thread_oop_barriers() &&
                    Universe::heap()->uses_stack_watermark_barrier()) {}
 
+  size_t expected_workers() {
+    // This worker does the light-weight tasks and guarantees progress
+    size_t workers = 1;
+
+    // Potential heavy-weights, allocate another worker for them
+    // TODO: InlineCacheBuffer should be here too.
+    if (_do_lazy_roots) {
+      workers++;
+    }
+
+    return workers;
+  }
+
   void work(uint worker_id) {
     // These tasks are ordered by relative length of time to execute so that potentially longer tasks start first.
     if (_subtasks.try_claim_task(SafepointSynchronize::SAFEPOINT_CLEANUP_SYMBOL_TABLE_REHASH)) {
@@ -601,9 +614,11 @@ void SafepointSynchronize::do_cleanup_tasks() {
   assert(heap != nullptr, "heap not initialized yet?");
   ParallelCleanupTask cleanup;
   WorkerThreads* cleanup_workers = heap->safepoint_workers();
-  if (cleanup_workers != nullptr) {
+  size_t num_workers = cleanup.expected_workers();
+  if (num_workers > 1 && cleanup_workers != nullptr) {
     // Parallel cleanup using GC provided thread pool.
-    cleanup_workers->run_task(&cleanup);
+    num_workers = MIN2<uint>(cleanup_workers->active_workers(), num_workers);
+    cleanup_workers->run_task(&cleanup, num_workers);
   } else {
     // Serial cleanup using VMThread.
     cleanup.work(0);


This could be done in a separate PR, of course, but I don't see why not test some approaches here :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13616#issuecomment-1522384536