RFR: 8306738: Select num workers for safepoint ParallelCleanupTask [v2]

Wed May 3 14:34:18 UTC 2023

On Thu, 27 Apr 2023 12:36:54 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

>>> So we need to investigate the cost of each operation, the possibility (and granularity) of estimating the work for each task and when the cost of spinning up worker threads is amortised by the parallel speedup.
>> 
>> Yes. 
>> 
>> I think the guiding principle here is that thread hand-off (delivering the wake up signal, wake up lag, waiting for worker thread termination) can easily take tens of microseconds. We established that in Parallel Streams work in JDK 8. So for tasks that only notify other, already threaded subsystems, spinning up the thread just to perform the notification of _another_ thread seems wasteful.
>> 
>> I think the only subtasks that do actual work are lazy roots (since they walk the unbounded number of threads) and IC buffer cleanups (when they have work, they can walk lots of IC stubs). The compounding factor is that string/symbol/oopstorage cleanup notification paths all take `ServiceLock`, and not only we spin up worker threads unnecessarily to essentially deliver a flag update, we also contend the worker threads over that lock! Oops. I wonder if that's where part of the cleanup time spikes comes from.
>> 
>> I believe that we don't need to go very deep in this. The crude patch I have above is probably already 80% there, we "only" need to handle the IC buffer updates there, and test this actually works on many scenarios. There is no need, IMO, in trying to micro-optimize the stability of "ParallelCleanupTask::estimated_workers", or change the way we claim the subtasks, etc, since from the ballbark experiments I had, those things consume about a microsecond -- diminishing returns for the complexity the deeper rewrite would entail. Just estimating the number of safepoint workers gets us lots of bang for little buck :)
>
>> > So we need to investigate the cost of each operation, the possibility (and granularity) of estimating the work for each task and when the cost of spinning up worker threads is amortised by the parallel speedup.
>> 
>> Yes.
>> 
>> I think the guiding principle here is that thread hand-off (delivering the wake up signal, wake up lag, waiting for worker thread termination) can easily take tens of microseconds. We established that in Parallel Streams work in JDK 8. So for tasks that only notify other, already threaded subsystems, spinning up the thread just to perform the notification of _another_ thread seems wasteful.
>> 
>> I think the only subtasks that do actual work are lazy roots (since they walk the unbounded number of threads) and IC buffer cleanups (when they have work, they can walk lots of IC stubs). The compounding factor is that string/symbol/oopstorage cleanup notification paths all take `ServiceLock`, and not only we spin up worker threads unnecessarily to essentially deliver a flag update, we also contend the worker threads over that lock! Oops. I wonder if that's where part of the cleanup time spikes comes from.
>> 
>> I believe that we don't need to go very deep in this. The crude patch I have above is probably already 80% there, we "only" need to handle the IC buffer updates there, and test this actually works on many scenarios. There is no need, IMO, in trying to micro-optimize the stability of "ParallelCleanupTask::estimated_workers", or change the way we claim the subtasks, etc, since from the ballbark experiments I had, those things consume about a microsecond -- diminishing returns for the complexity the deeper rewrite would entail. Just estimating the number of safepoint workers gets us lots of bang for little buck :)
> 
> Alright, my only fear is underestimating the work for some workload, and providing a change where we increase the safepoint times due to selecting to few workers.

> @xmas92, what is your plan for this? It would be perfect to get it into JDK 21!

Same!

I propose an implementation which estimates the expected workers conservatively. That is:
  * Keep the previous behaviour with respect to `active_workers()` (Never more than `active_workers()`)
  * Use at most the expected number of workers, which is:
    * +1 if SymbolTable needs to actually rehash with alt hash
    * +1 if StringTable needs to actually rehash with alt hash
    * +1 If InlineCacheBuffer has any work to do
    * +1 If thread roots requires processing
  * If only one (or zero) workers are expected do not use any worker threads

This seems like a good first step too me, which in the worst case results in four workers and in the common case one or two, depending on GC and VM state.

There are some potential RFEs here to improve the expected workers for inline cache buffers and thread roots where we can find some lower bound  where the work required cannot amortise the cost of using worker threads. 

Also I am not 100% happy with the `uint xxx_expected_workers()` function names and signatures. So if there are any alternative suggestions, I am all ears. 

(Also the xxxTable code duplication is pretty atrocious)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13616#issuecomment-1533128637