Performance of pooling virtual threads vs. semaphores
Liam Miller-Cushon
cushon at google.com
Tue Jun 4 23:22:43 UTC 2024
Hi,
Thanks for the comments. There's more to dig into here and to respond to,
but I have a few notes for now.
re: whether the question is about making this code faster or why case (2)
scales weirdly, it's sort of both--we're interested in understanding (2)
better, but if (2) is expected to be expensive compared to pushing back,
that's helpful.
To provide some more color on the real-world program that inspired the
question, it's the Bazel build system (https://github.com/google/bazel).
Bazel constructs a graph of build 'actions', where each node is a command
to run during the build (for example a call to javac that takes some source
files and classpath jars as inputs, and produces class file outputs). The
current approach uses a pool of physical threads to execute actions.
Actions can block on IO, and especially on RPCs to remote machines that are
performing build steps. The goal is to use virtual threads to make the
actions async, and to replace the limit on the number of concurrent actions
with a dynamic limit based on the availability of other resources (memory,
and the number of in-flight RPCs to remote build machines). As a first
step, Chi was trying to get a baseline where the virtual thread-based
implementation used the same concurrency limit as the previous approach,
ideally with roughly the same CPU overhead.
I have more benchmarking results from the original machine using one of the
approaches that was suggested, of limiting the number of in-flight virtual
threads using a ThreadFactory with a semaphore. The new strategy (4) uses
less CPU than strategy (2) where all of the virtual threads are started at
the same time, but still uses more CPU than the thread pooling approach in
(3).
With strategy 4 as:
```
case 4 -> {
var factorySem = new Semaphore(parallelism);
ThreadFactory tf = (Runnable r) -> {
try {
factorySem.acquire();
} catch (InterruptedException ex) {
throw new IllegalStateException("interrupted");
}
return Thread.ofVirtual().unstarted(() ->
{
try {
r.run();
} finally {
factorySem.release();
}
});
};
executor = Executors.newThreadPerTaskExecutor(tf);
```
The results are:
# 128 CPUs
Benchmark 1: java Main -- 1 10000
Time (mean ± σ): 719.7 ms ± 45.9 ms [User: 23732.3 ms, System:
1155.3 ms]
Range (min … max): 651.1 ms … 794.0 ms 10 runs
Benchmark 2: java Main -- 2 10000
Time (mean ± σ): 635.8 ms ± 66.9 ms [User: 21271.1 ms, System:
1685.1 ms]
Range (min … max): 565.5 ms … 769.6 ms 10 runs
Benchmark 3: java Main -- 3 10000
Time (mean ± σ): 486.2 ms ± 9.3 ms [User: 10804.0 ms, System:
1181.6 ms]
Range (min … max): 464.5 ms … 495.1 ms 10 runs
Benchmark 4: java Main -- 4 10000
Time (mean ± σ): 479.9 ms ± 3.7 ms [User: 8366.7 ms, System:
1201.9 ms]
Range (min … max): 474.8 ms … 488.0 ms 10 runs
Summary
java Main -- 4 10000 ran
1.01 ± 0.02 times faster than java Main -- 3 10000
1.32 ± 0.14 times faster than java Main -- 2 10000
1.50 ± 0.10 times faster than java Main -- 1 10000
Benchmark 1: java Main -- 1 100000
Time (mean ± σ): 3.803 s ± 0.044 s [User: 41.227 s, System:
2.418 s]
Range (min … max): 3.737 s … 3.874 s 10 runs
Benchmark 2: java Main -- 2 100000
Time (mean ± σ): 3.761 s ± 0.063 s [User: 53.494 s, System:
3.380 s]
Range (min … max): 3.651 s … 3.848 s 10 runs
Benchmark 3: java Main -- 3 100000
Time (mean ± σ): 3.555 s ± 0.040 s [User: 20.824 s, System:
2.350 s]
Range (min … max): 3.529 s … 3.667 s 10 runs
Benchmark 4: java Main -- 4 100000
Time (mean ± σ): 3.558 s ± 0.011 s [User: 28.272 s, System:
3.018 s]
Range (min … max): 3.544 s … 3.581 s 10 runs
Summary
java Main -- 3 100000 ran
1.00 ± 0.01 times faster than java Main -- 4 100000
1.06 ± 0.02 times faster than java Main -- 2 100000
1.07 ± 0.02 times faster than java Main -- 1 100000
Benchmark 1: java Main -- 1 1000000
Time (mean ± σ): 34.182 s ± 0.020 s [User: 165.208 s, System:
15.016 s]
Range (min … max): 34.148 s … 34.205 s 10 runs
Benchmark 2: java Main -- 2 1000000
Time (mean ± σ): 34.917 s ± 0.270 s [User: 335.774 s, System:
23.993 s]
Range (min … max): 34.575 s … 35.457 s 10 runs
Benchmark 3: java Main -- 3 1000000
Time (mean ± σ): 36.146 s ± 3.468 s [User: 163.926 s, System:
10.816 s]
Range (min … max): 34.209 s … 45.588 s 10 runs
Benchmark 4: java Main -- 4 1000000
Time (mean ± σ): 34.390 s ± 0.051 s [User: 216.980 s, System:
22.147 s]
Range (min … max): 34.303 s … 34.446 s 10 runs
Summary
java Main -- 1 1000000 ran
1.01 ± 0.00 times faster than java Main -- 4 1000000
1.02 ± 0.01 times faster than java Main -- 2 1000000
1.06 ± 0.10 times faster than java Main -- 3 1000000
# 16 CPUs
Benchmark 1: java Main -- 1 10000
Time (mean ± σ): 825.6 ms ± 50.6 ms [User: 6863.7 ms, System:
779.8 ms]
Range (min … max): 745.2 ms … 936.4 ms 10 runs
Benchmark 2: java Main -- 2 10000
Time (mean ± σ): 515.7 ms ± 8.4 ms [User: 4591.8 ms, System:
274.2 ms]
Range (min … max): 502.7 ms … 532.9 ms 10 runs
Benchmark 3: java Main -- 3 10000
Time (mean ± σ): 494.8 ms ± 34.9 ms [User: 3751.9 ms, System:
190.6 ms]
Range (min … max): 463.3 ms … 580.0 ms 10 runs
Benchmark 4: java Main -- 4 10000
Time (mean ± σ): 473.2 ms ± 7.5 ms [User: 3851.2 ms, System:
248.5 ms]
Range (min … max): 464.1 ms … 483.3 ms 10 runs
Summary
java Main -- 4 10000 ran
1.05 ± 0.08 times faster than java Main -- 3 10000
1.09 ± 0.02 times faster than java Main -- 2 10000
1.74 ± 0.11 times faster than java Main -- 1 10000
Benchmark 1: java Main -- 1 100000
Time (mean ± σ): 3.949 s ± 0.099 s [User: 32.922 s, System:
2.386 s]
Range (min … max): 3.882 s … 4.222 s 10 runs
Benchmark 2: java Main -- 2 100000
Time (mean ± σ): 3.739 s ± 0.063 s [User: 31.259 s, System:
2.117 s]
Range (min … max): 3.661 s … 3.832 s 10 runs
Benchmark 3: java Main -- 3 100000
Time (mean ± σ): 5.106 s ± 0.990 s [User: 20.897 s, System:
0.686 s]
Range (min … max): 3.955 s … 7.679 s 10 runs
Benchmark 4: java Main -- 4 100000
Time (mean ± σ): 3.570 s ± 0.013 s [User: 27.292 s, System:
1.807 s]
Range (min … max): 3.557 s … 3.594 s 10 runs
Summary
java Main -- 4 100000 ran
1.05 ± 0.02 times faster than java Main -- 2 100000
1.11 ± 0.03 times faster than java Main -- 1 100000
1.43 ± 0.28 times faster than java Main -- 3 100000
Benchmark 1: java Main -- 1 1000000
Time (mean ± σ): 34.536 s ± 0.076 s [User: 263.045 s, System:
16.227 s]
Range (min … max): 34.457 s … 34.717 s 10 runs
Benchmark 2: java Main -- 2 1000000
Time (mean ± σ): 35.257 s ± 0.191 s [User: 285.159 s, System:
19.702 s]
Range (min … max): 35.015 s … 35.561 s 10 runs
Benchmark 3: java Main -- 3 1000000
Time (mean ± σ): 60.157 s ± 10.130 s [User: 171.901 s, System:
3.902 s]
Range (min … max): 46.437 s … 74.799 s 10 runs
Benchmark 4: java Main -- 4 1000000
Time (mean ± σ): 34.370 s ± 0.067 s [User: 256.460 s, System:
19.605 s]
Range (min … max): 34.299 s … 34.455 s 10 runs
Summary
java Main -- 4 1000000 ran
1.00 ± 0.00 times faster than java Main -- 1 1000000
1.03 ± 0.01 times faster than java Main -- 2 1000000
1.75 ± 0.29 times faster than java Main -- 3 1000000
On Thu, May 30, 2024 at 11:22 AM Ron Pressler <ron.pressler at oracle.com>
wrote:
>
>
> > On 28 May 2024, at 23:43, Liam Miller-Cushon <cushon at google.com> wrote:
> >
> > Hello,
> >
> > JEP 444 [1] and the docs in [2] mention that virtual threads should not
> be pooled, and suggest semaphores as one possible alternative.
> >
> > My colleague Chi Wang has been investigating virtual thread performance,
> and has found that creating one virtual thread per task and synchronizing
> on a semaphore can result in worse performance on machines with large
> numbers of cores.
> >
> > A benchmark run on a 128 core machine is included below. It submits
> numTasks tasks to the executor determined by the strategy. The task itself
> is mixed with CPU and I/O work (simulated by fibonacci and sleep). The
> parallelism is set to 600 for all strategies.
> >
> > * Strategy 1 is the baseline where it just submits all the tasks to a
> ForkJoinPool whose pool size is 600.
> > * Strategy 2 uses the method suggested by JEP 444.
> > * Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.
> >
> > Note that, with 100K and 1M tasks, strategy 2 has a CPU time regression
> that seems to increase with the number of cores. This result can be
> reproduced on the real-world program that is being migrated to a
> virtual-thread-per-task model.
> >
> > Diffing the cpu profile between strategy 1 and strategy 2 showed that
> most of the CPU regression comes from method
> `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue,
> long, int)`.
> >
> > Are there any ideas for why the semaphore strategy uses more CPU than
> pooling virtual threads on machines with a large number of cores?
>
> My first guess would be that what’s being compared here is not the same
> algorithm. In strategy 2, you spawn a bunch of different threads that do a
> tiny bit of work before starting the task, while in the other strategies
> the algorithm is different.
>
> In a realistic workload you will have a number of threads serving
> requests, and then they may want to access a concurrency-limited resource.
> I think you can try to simulate that by simulating, say, 1000 tasks running
> in some large thread pool and then each of them would submit one subtasks
> to a fixed thread pool of size 10. For virtual threads, have 1000 virtual
> threads that try to acquire a semaphore with 10 leases.
>
> Can you try that? At least then we’ll be comparing oranges to oranges.
>
> — Ron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240604/d2c9fa43/attachment-0001.htm>
More information about the loom-dev
mailing list