RFR: 8321242: Enable WorkerThreads to run tasks in caller thread [v5]
Stefan Karlsson
stefank at openjdk.org
Thu Dec 7 08:02:33 UTC 2023
On Wed, 6 Dec 2023 15:24:38 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> > I had hoped that we could get rid of even more special handling of the caller in the WorkerTaskDispatcher, but I see that the proposed patch still tweaks `_num_finished` and runs the caller class without partaking in the counter logic.
>
> Sorry, I don't see "even more special handling". Caller _does_ participate in counting `_started` and `_not_finished` now. The only thing that caller specifically does is reserving one task for itself (by issuing one less permit at `_start_semaphore` -- think about it as consuming a permit right after issuing the batch of them), because...
Right. That's what I'm not particularly found of.
>
> > I guess that the motivation for this is so that we skip all signaling and skip potentially waking up a worker when it isn't needed. I think I would prefer if we had a special-case for this, and make sure that all threads that participate in the WorkerTaskDispatcher use signaling and accounting.
>
> ...the problem is not just waking up the workers unnecessarily, but dealing with the situation when caller wakes up a worker, loses a tiny race against the waking up worker, and then realizes that worker started executing the task. This would miss the optimization we actually wanted to begin with: avoiding the round-trip to worker thread for executing a single task.
That's exactly the scenario I tried to describe above.
>
> I see your [c1c7295](https://github.com/openjdk/jdk/commit/c1c7295eb0e5bb5b3ad35bfae8316cbaed62cacb) avoids this for `workers=1` case by special-handling it, but the problem manifests for threads=2 cases as well.
>
> ```
> # This PR:
>
> Test with 2 workers:
> only workers:
> 88857.521 us total; 8.886 us avg; 19.308 us max
> 91031.770 us total; 9.103 us avg; 19.632 us max
> 92511.074 us total; 9.251 us avg; 18.840 us max
> 88850.082 us total; 8.885 us avg; 20.686 us max
> 88864.039 us total; 8.886 us avg; 19.132 us max
> workers + caller:
> 16625.124 us total; 1.663 us avg; 11.781 us max
> 16658.809 us total; 1.666 us avg; 9.663 us max
> 16434.287 us total; 1.643 us avg; 11.400 us max
> 16498.131 us total; 1.650 us avg; 9.998 us max
> 16496.808 us total; 1.650 us avg; 13.919 us max
>
> # Mixing in: https://github.com/openjdk/jdk/commit/c1c7295eb0e5bb5b3ad35bfae8316cbaed62cacb
>
> Test with 2 workers:
> only workers:
> 99672.791 us total; 9.967 us avg; 19.332 us max
> 95419.048 us total; 9.542 us avg; 19.795 us max
> 64211.455 us total; 6.421 us avg; 20.518 us max
> 63979.810 us total; 6.398 us avg; 46.090 us max
> 92039.049 us total; 9.204 us avg; 32.014 us max
> workers + caller:
> 31425.773 us total; 3.143 us avg; 10.903 us max
> 31333.930 us total; 3.133 us avg; 17.883 us max
> 31492.294 us total; 3.149 us avg; 12.184 us max
> 31560.567 us total; 3.156 us avg; 12.971 us max
> 31555.060 us total; 3.156 us avg; 12.468 us max
> ```
Right. You skip waking up one out of a set of workers and you get some minuscule benefits from that.
>
> > Yet another alternative is to limit the change to only let the caller run if the task is only supposed to run with one worker.
>
> Yeah, I started from that version. But then I realized that doing N-1 instead on N round-trips to worker threads is also beneficial, since every round-trip is a latency hiccup waiting to happen.
>
> But I also think we have to consider reliability/testability: if we make a special case for `num_tasks = 1`, the whole thing would actually be less testable. You would need to make sure that every time you have a task that can be executed in caller, _some testing configuration_ would execute it with single-threaded worker pool to cover caller-run path. Always taking the task in caller, if we can, makes caller-run paths exercised much more often.
>
> So at this point I am not in favor of special-casing the single-task path.
I'm still not convinced. I understand that you can measure a micro second(!) difference in the implementations when you run these no-op tasks. I think you need to motivate this change by giving a concrete GC task and use that as example where this would be useful and then we can start to evaluate the performance difference between the various approaches.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16945#issuecomment-1844845558
More information about the hotspot-gc-dev
mailing list