RFR: 8321242: Enable WorkerThreads to run tasks in caller thread [v5]

Thu Dec 7 08:02:33 UTC 2023

On Wed, 6 Dec 2023 15:24:38 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> > I had hoped that we could get rid of even more special handling of the caller in the WorkerTaskDispatcher, but I see that the proposed patch still tweaks `_num_finished` and runs the caller class without partaking in the counter logic.
> 
> Sorry, I don't see "even more special handling". Caller _does_ participate in counting `_started` and `_not_finished` now. The only thing that caller specifically does is reserving one task for itself (by issuing one less permit at `_start_semaphore` -- think about it as consuming a permit right after issuing the batch of them), because...

Right. That's what I'm not particularly found of.

> 
> > I guess that the motivation for this is so that we skip all signaling and skip potentially waking up a worker when it isn't needed. I think I would prefer if we had a special-case for this, and make sure that all threads that participate in the WorkerTaskDispatcher use signaling and accounting.
> 
> ...the problem is not just waking up the workers unnecessarily, but dealing with the situation when caller wakes up a worker, loses a tiny race against the waking up worker, and then realizes that worker started executing the task. This would miss the optimization we actually wanted to begin with: avoiding the round-trip to worker thread for executing a single task.

That's exactly the scenario I tried to describe above. 

> 
> I see your [c1c7295](https://github.com/openjdk/jdk/commit/c1c7295eb0e5bb5b3ad35bfae8316cbaed62cacb) avoids this for `workers=1` case by special-handling it, but the problem manifests for threads=2 cases as well.
> 
> ```
> # This PR:
> 
> Test with 2 workers:                                 
>   only workers:
>        88857.521 us total;      8.886 us avg;     19.308 us max
>        91031.770 us total;      9.103 us avg;     19.632 us max
>        92511.074 us total;      9.251 us avg;     18.840 us max
>        88850.082 us total;      8.885 us avg;     20.686 us max
>        88864.039 us total;      8.886 us avg;     19.132 us max
>   workers + caller:
>        16625.124 us total;      1.663 us avg;     11.781 us max
>        16658.809 us total;      1.666 us avg;      9.663 us max
>        16434.287 us total;      1.643 us avg;     11.400 us max
>        16498.131 us total;      1.650 us avg;      9.998 us max
>        16496.808 us total;      1.650 us avg;     13.919 us max
> 
> # Mixing in: https://github.com/openjdk/jdk/commit/c1c7295eb0e5bb5b3ad35bfae8316cbaed62cacb
> 
> Test with 2 workers:
>   only workers:
>        99672.791 us total;      9.967 us avg;     19.332 us max
>        95419.048 us total;      9.542 us avg;     19.795 us max
>        64211.455 us total;      6.421 us avg;     20.518 us max
>        63979.810 us total;      6.398 us avg;     46.090 us max
>        92039.049 us total;      9.204 us avg;     32.014 us max
>   workers + caller:
>        31425.773 us total;      3.143 us avg;     10.903 us max
>        31333.930 us total;      3.133 us avg;     17.883 us max
>        31492.294 us total;      3.149 us avg;     12.184 us max
>        31560.567 us total;      3.156 us avg;     12.971 us max
>        31555.060 us total;      3.156 us avg;     12.468 us max
> ```

Right. You skip waking up one out of a set of workers and you get some minuscule benefits from that.

> 
> > Yet another alternative is to limit the change to only let the caller run if the task is only supposed to run with one worker.
> 
> Yeah, I started from that version. But then I realized that doing N-1 instead on N round-trips to worker threads is also beneficial, since every round-trip is a latency hiccup waiting to happen.
> 
> But I also think we have to consider reliability/testability: if we make a special case for `num_tasks = 1`, the whole thing would actually be less testable. You would need to make sure that every time you have a task that can be executed in caller, _some testing configuration_ would execute it with single-threaded worker pool to cover caller-run path. Always taking the task in caller, if we can, makes caller-run paths exercised much more often.
> 
> So at this point I am not in favor of special-casing the single-task path.

I'm still not convinced. I understand that you can measure a micro second(!) difference in the implementations when you run these no-op tasks. I think you need to motivate this change by giving a concrete GC task and use that as example where this would be useful and then we can start to evaluate the performance difference between the various approaches.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16945#issuecomment-1844845558