Performance of pooling virtual threads vs. semaphores

Wed Jun 5 00:16:32 UTC 2024

Yes, I think the Ryzen 3995WX has 64 real cores, 128 was with
hyperthreading enabled.

I don't have numbers from it yet (it isn't my machine), I'll get numbers
with hyperthreading disabled and with your VirtualThreadExecutorService
approach from [1] and report back.

[1] https://mail.openjdk.org/pipermail/loom-dev/2024-May/006627.html

On Tue, Jun 4, 2024 at 4:35 PM Robert Engels <robaho at icloud.com> wrote:

> I know I sent a bunch of replies, but did you by chance run the tests
> using the VirtualThreadExecutorService that doesn’t have the semaphore?
>
> But, more importantly, as I also pointed out, something seems way off in
> the timings. I don’t think the 128 core machine is actually 128 available
> cores - the timings are way too high. With this test you should get a
> nearly linear increase in performance as the number of carrier threads
> increase. The tests here aren’t even close. I am guessing these are virtual
> cores backed by their own scheduler on a shared VM.
>
> With virtual threads the number of carrier threads needs to match the real
> available cores or you will waste CPU with thrashing of the scheduler.
>
>
> On Jun 4, 2024, at 6:22 PM, Liam Miller-Cushon <cushon at google.com> wrote:
>
> Hi,
>
> Thanks for the comments. There's more to dig into here and to respond to,
> but I have a few notes for now.
>
> re: whether the question is about making this code faster or why case (2)
> scales weirdly, it's sort of both--we're interested in understanding (2)
> better, but if (2) is expected to be expensive compared to pushing back,
> that's helpful.
>
> To provide some more color on the real-world program that inspired the
> question, it's the Bazel build system (https://github.com/google/bazel).
> Bazel constructs a graph of build 'actions', where each node is a command
> to run during the build (for example a call to javac that takes some source
> files and classpath jars as inputs, and produces class file outputs). The
> current approach uses a pool of physical threads to execute actions.
> Actions can block on IO, and especially on RPCs to remote machines that are
> performing build steps. The goal is to use virtual threads to make the
> actions async, and to replace the limit on the number of concurrent actions
> with a dynamic limit based on the availability of other resources (memory,
> and the number of in-flight RPCs to remote build machines). As a first
> step, Chi was trying to get a baseline where the virtual thread-based
> implementation used the same concurrency limit as the previous approach,
> ideally with roughly the same CPU overhead.
>
> I have more benchmarking results from the original machine using one of
> the approaches that was suggested, of limiting the number of in-flight
> virtual threads using a ThreadFactory with a semaphore. The new strategy
> (4) uses less CPU than strategy (2) where all of the virtual threads are
> started at the same time, but still uses more CPU than the thread pooling
> approach in (3).
>
> With strategy 4 as:
>
> ```
>       case 4 -> {
>         var factorySem = new Semaphore(parallelism);
>         ThreadFactory tf = (Runnable r) -> {
>           try {
>             factorySem.acquire();
>           } catch (InterruptedException ex) {
>             throw new IllegalStateException("interrupted");
>           }
>           return Thread.ofVirtual().unstarted(() ->
>               {
>                 try {
>                   r.run();
>                 } finally {
>                   factorySem.release();
>                 }
>               });
>         };
>         executor = Executors.newThreadPerTaskExecutor(tf);
> ```
>
> The results are:
>
> # 128 CPUs
>
> Benchmark 1: java Main -- 1 10000
>   Time (mean ± σ):     719.7 ms ±  45.9 ms    [User: 23732.3 ms, System:
> 1155.3 ms]
>   Range (min … max):   651.1 ms … 794.0 ms    10 runs
>
> Benchmark 2: java Main -- 2 10000
>   Time (mean ± σ):     635.8 ms ±  66.9 ms    [User: 21271.1 ms, System:
> 1685.1 ms]
>   Range (min … max):   565.5 ms … 769.6 ms    10 runs
>
> Benchmark 3: java Main -- 3 10000
>   Time (mean ± σ):     486.2 ms ±   9.3 ms    [User: 10804.0 ms, System:
> 1181.6 ms]
>   Range (min … max):   464.5 ms … 495.1 ms    10 runs
>
> Benchmark 4: java Main -- 4 10000
>   Time (mean ± σ):     479.9 ms ±   3.7 ms    [User: 8366.7 ms, System:
> 1201.9 ms]
>   Range (min … max):   474.8 ms … 488.0 ms    10 runs
>
> Summary
>   java Main -- 4 10000 ran
>     1.01 ± 0.02 times faster than java Main -- 3 10000
>     1.32 ± 0.14 times faster than java Main -- 2 10000
>     1.50 ± 0.10 times faster than java Main -- 1 10000
>
> Benchmark 1: java Main -- 1 100000
>   Time (mean ± σ):      3.803 s ±  0.044 s    [User: 41.227 s, System:
> 2.418 s]
>   Range (min … max):    3.737 s …  3.874 s    10 runs
>
> Benchmark 2: java Main -- 2 100000
>   Time (mean ± σ):      3.761 s ±  0.063 s    [User: 53.494 s, System:
> 3.380 s]
>   Range (min … max):    3.651 s …  3.848 s    10 runs
>
> Benchmark 3: java Main -- 3 100000
>   Time (mean ± σ):      3.555 s ±  0.040 s    [User: 20.824 s, System:
> 2.350 s]
>   Range (min … max):    3.529 s …  3.667 s    10 runs
>
> Benchmark 4: java Main -- 4 100000
>   Time (mean ± σ):      3.558 s ±  0.011 s    [User: 28.272 s, System:
> 3.018 s]
>   Range (min … max):    3.544 s …  3.581 s    10 runs
>
> Summary
>   java Main -- 3 100000 ran
>     1.00 ± 0.01 times faster than java Main -- 4 100000
>     1.06 ± 0.02 times faster than java Main -- 2 100000
>     1.07 ± 0.02 times faster than java Main -- 1 100000
>
> Benchmark 1: java Main -- 1 1000000
>   Time (mean ± σ):     34.182 s ±  0.020 s    [User: 165.208 s, System:
> 15.016 s]
>   Range (min … max):   34.148 s … 34.205 s    10 runs
>
> Benchmark 2: java Main -- 2 1000000
>   Time (mean ± σ):     34.917 s ±  0.270 s    [User: 335.774 s, System:
> 23.993 s]
>   Range (min … max):   34.575 s … 35.457 s    10 runs
>
> Benchmark 3: java Main -- 3 1000000
>   Time (mean ± σ):     36.146 s ±  3.468 s    [User: 163.926 s, System:
> 10.816 s]
>   Range (min … max):   34.209 s … 45.588 s    10 runs
>
> Benchmark 4: java Main -- 4 1000000
>   Time (mean ± σ):     34.390 s ±  0.051 s    [User: 216.980 s, System:
> 22.147 s]
>   Range (min … max):   34.303 s … 34.446 s    10 runs
>
> Summary
>   java Main -- 1 1000000 ran
>     1.01 ± 0.00 times faster than java Main -- 4 1000000
>     1.02 ± 0.01 times faster than java Main -- 2 1000000
>     1.06 ± 0.10 times faster than java Main -- 3 1000000
>
>
> # 16 CPUs
>
> Benchmark 1: java Main -- 1 10000
>   Time (mean ± σ):     825.6 ms ±  50.6 ms    [User: 6863.7 ms, System:
> 779.8 ms]
>   Range (min … max):   745.2 ms … 936.4 ms    10 runs
>
> Benchmark 2: java Main -- 2 10000
>   Time (mean ± σ):     515.7 ms ±   8.4 ms    [User: 4591.8 ms, System:
> 274.2 ms]
>   Range (min … max):   502.7 ms … 532.9 ms    10 runs
>
> Benchmark 3: java Main -- 3 10000
>   Time (mean ± σ):     494.8 ms ±  34.9 ms    [User: 3751.9 ms, System:
> 190.6 ms]
>   Range (min … max):   463.3 ms … 580.0 ms    10 runs
>
> Benchmark 4: java Main -- 4 10000
>   Time (mean ± σ):     473.2 ms ±   7.5 ms    [User: 3851.2 ms, System:
> 248.5 ms]
>   Range (min … max):   464.1 ms … 483.3 ms    10 runs
>
> Summary
>
>   java Main -- 4 10000 ran
>     1.05 ± 0.08 times faster than java Main -- 3 10000
>     1.09 ± 0.02 times faster than java Main -- 2 10000
>     1.74 ± 0.11 times faster than java Main -- 1 10000
>
> Benchmark 1: java Main -- 1 100000
>   Time (mean ± σ):      3.949 s ±  0.099 s    [User: 32.922 s, System:
> 2.386 s]
>   Range (min … max):    3.882 s …  4.222 s    10 runs
>
> Benchmark 2: java Main -- 2 100000
>   Time (mean ± σ):      3.739 s ±  0.063 s    [User: 31.259 s, System:
> 2.117 s]
>   Range (min … max):    3.661 s …  3.832 s    10 runs
>
> Benchmark 3: java Main -- 3 100000
>   Time (mean ± σ):      5.106 s ±  0.990 s    [User: 20.897 s, System:
> 0.686 s]
>   Range (min … max):    3.955 s …  7.679 s    10 runs
>
> Benchmark 4: java Main -- 4 100000
>   Time (mean ± σ):      3.570 s ±  0.013 s    [User: 27.292 s, System:
> 1.807 s]
>   Range (min … max):    3.557 s …  3.594 s    10 runs
>
> Summary
>
>   java Main -- 4 100000 ran
>     1.05 ± 0.02 times faster than java Main -- 2 100000
>     1.11 ± 0.03 times faster than java Main -- 1 100000
>     1.43 ± 0.28 times faster than java Main -- 3 100000
>
> Benchmark 1: java Main -- 1 1000000
>   Time (mean ± σ):     34.536 s ±  0.076 s    [User: 263.045 s, System:
> 16.227 s]
>   Range (min … max):   34.457 s … 34.717 s    10 runs
>
> Benchmark 2: java Main -- 2 1000000
>   Time (mean ± σ):     35.257 s ±  0.191 s    [User: 285.159 s, System:
> 19.702 s]
>   Range (min … max):   35.015 s … 35.561 s    10 runs
>
> Benchmark 3: java Main -- 3 1000000
>   Time (mean ± σ):     60.157 s ± 10.130 s    [User: 171.901 s, System:
> 3.902 s]
>   Range (min … max):   46.437 s … 74.799 s    10 runs
>
> Benchmark 4: java Main -- 4 1000000
>   Time (mean ± σ):     34.370 s ±  0.067 s    [User: 256.460 s, System:
> 19.605 s]
>   Range (min … max):   34.299 s … 34.455 s    10 runs
>
> Summary
>   java Main -- 4 1000000 ran
>     1.00 ± 0.00 times faster than java Main -- 1 1000000
>     1.03 ± 0.01 times faster than java Main -- 2 1000000
>     1.75 ± 0.29 times faster than java Main -- 3 1000000
>
> On Thu, May 30, 2024 at 11:22 AM Ron Pressler <ron.pressler at oracle.com>
> wrote:
>
>>
>>
>> > On 28 May 2024, at 23:43, Liam Miller-Cushon <cushon at google.com> wrote:
>> >
>> > Hello,
>> >
>> > JEP 444 [1] and the docs in [2] mention that virtual threads should not
>> be pooled, and suggest semaphores as one possible alternative.
>> >
>> > My colleague Chi Wang has been investigating virtual thread
>> performance, and has found that creating one virtual thread per task and
>> synchronizing on a semaphore can result in worse performance on machines
>> with large numbers of cores.
>> >
>> > A benchmark run on a 128 core machine is included below. It submits
>> numTasks tasks to the executor determined by the strategy. The task itself
>> is mixed with CPU and I/O work (simulated by fibonacci and sleep). The
>> parallelism is set to 600 for all strategies.
>> >
>> > * Strategy 1 is the baseline where it just submits all the tasks to a
>> ForkJoinPool whose pool size is 600.
>> > * Strategy 2 uses the method suggested by JEP 444.
>> > * Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.
>> >
>> > Note that, with 100K and 1M tasks, strategy 2 has a CPU time regression
>> that seems to increase with the number of cores. This result can be
>> reproduced on the real-world program that is being migrated to a
>> virtual-thread-per-task model.
>> >
>> > Diffing the cpu profile between strategy 1 and strategy 2 showed that
>> most of the CPU regression comes from method
>> `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue,
>> long, int)`.
>> >
>> > Are there any ideas for why the semaphore strategy uses more CPU than
>> pooling virtual threads on machines with a large number of cores?
>>
>> My first guess would be that what’s being compared here is not the same
>> algorithm. In strategy 2, you spawn a bunch of different threads that do a
>> tiny bit of work before starting the task, while in the other strategies
>> the algorithm is different.
>>
>> In a realistic workload you will have a number of threads serving
>> requests, and then they may want to access a concurrency-limited resource.
>> I think you can try to simulate that by simulating, say, 1000 tasks running
>> in some large thread pool and then each of them would submit one subtasks
>> to a fixed thread pool of size 10. For virtual threads, have 1000 virtual
>> threads that try to acquire a semaphore with 10 leases.
>>
>> Can you try that? At least then we’ll be comparing oranges to oranges.
>>
>> — Ron
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240604/69eadc49/attachment-0001.htm>