<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">I know I sent a bunch of replies, but did you by chance run the tests using the VirtualThreadExecutorService that doesn’t have the semaphore?<div class=""><br class=""></div><div class="">But, more importantly, as I also pointed out, something seems way off in the timings. I don’t think the 128 core machine is actually 128 available cores - the timings are way too high. With this test you should get a nearly linear increase in performance as the number of carrier threads increase. The tests here aren’t even close. I am guessing these are virtual cores backed by their own scheduler on a shared VM.</div><div class=""><br class=""></div><div class="">With virtual threads the number of carrier threads needs to match the real available cores or you will waste CPU with thrashing of the scheduler.</div><div class=""><div class=""><div class=""><br class=""></div><div class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Jun 4, 2024, at 6:22 PM, Liam Miller-Cushon <<a href="mailto:cushon@google.com" class="">cushon@google.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Hi,<br class=""><br class="">Thanks for the comments. There's more to dig into here and to respond to, but I have a few notes for now.<div class=""><br class=""></div><div class="">re: whether the question is about making this code faster or why case (2) scales weirdly, it's sort of both--we're interested in understanding (2) better, but if (2) is expected to be expensive compared to pushing back, that's helpful.<br class=""><br class=""><div class="">To provide some more color on the real-world program that inspired the question, it's the Bazel build system (<a href="https://github.com/google/bazel" target="_blank" class="">https://github.com/google/bazel</a>). Bazel constructs a graph of build 'actions', where each node is a command to run during the build (for example a call to javac that takes some source files and classpath jars as inputs, and produces class file outputs). The current approach uses a pool of physical threads to execute actions. Actions can block on IO, and especially on RPCs to remote machines that are performing build steps. The goal is to use virtual threads to make the actions async, and to replace the limit on the number of concurrent actions with a dynamic limit based on the availability of other resources (memory, and the number of in-flight RPCs to remote build machines). As a first step, Chi was trying to get a baseline where the virtual thread-based implementation used the same concurrency limit as the previous approach, ideally with roughly the same CPU overhead.<div class=""><br class=""></div><div class="">I have more benchmarking results from the original machine using one of the approaches that was suggested, of limiting the number of in-flight virtual threads using a ThreadFactory with a semaphore. The new strategy (4) uses less CPU than strategy (2) where all of the virtual threads are started at the same time, but still uses more CPU than the thread pooling approach in (3).<br class=""><br class="">With strategy 4 as:<br class=""><br class="">```<br class=""> case 4 -> {<br class=""> var factorySem = new Semaphore(parallelism);<br class=""> ThreadFactory tf = (Runnable r) -> {<br class=""> try {<br class=""> factorySem.acquire();<br class=""> } catch (InterruptedException ex) {<br class=""> throw new IllegalStateException("interrupted");<br class=""> }<br class=""> return Thread.ofVirtual().unstarted(() -> <br class=""> {<br class=""> try { <br class=""> r.run(); <br class=""> } finally { <br class=""> factorySem.release();<br class=""> }<br class=""> });<br class=""> };<br class=""> executor = Executors.newThreadPerTaskExecutor(tf);<br class="">```<br class=""><br class="">The results are:<br class=""><br class=""># 128 CPUs<br class=""><br class="">Benchmark 1: java Main -- 1 10000<br class=""> Time (mean ± σ): 719.7 ms ± 45.9 ms [User: 23732.3 ms, System: 1155.3 ms]<br class=""> Range (min … max): 651.1 ms … 794.0 ms 10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 10000<br class=""> Time (mean ± σ): 635.8 ms ± 66.9 ms [User: 21271.1 ms, System: 1685.1 ms]<br class=""> Range (min … max): 565.5 ms … 769.6 ms 10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 10000<br class=""> Time (mean ± σ): 486.2 ms ± 9.3 ms [User: 10804.0 ms, System: 1181.6 ms]<br class=""> Range (min … max): 464.5 ms … 495.1 ms 10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 10000<br class=""> Time (mean ± σ): 479.9 ms ± 3.7 ms [User: 8366.7 ms, System: 1201.9 ms]<br class=""> Range (min … max): 474.8 ms … 488.0 ms 10 runs<br class=""><br class="">Summary<br class=""> java Main -- 4 10000 ran<br class=""> 1.01 ± 0.02 times faster than java Main -- 3 10000<br class=""> 1.32 ± 0.14 times faster than java Main -- 2 10000<br class=""> 1.50 ± 0.10 times faster than java Main -- 1 10000<br class=""><br class="">Benchmark 1: java Main -- 1 100000<br class=""> Time (mean ± σ): 3.803 s ± 0.044 s [User: 41.227 s, System: 2.418 s]<br class=""> Range (min … max): 3.737 s … 3.874 s 10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 100000<br class=""> Time (mean ± σ): 3.761 s ± 0.063 s [User: 53.494 s, System: 3.380 s]<br class=""> Range (min … max): 3.651 s … 3.848 s 10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 100000<br class=""> Time (mean ± σ): 3.555 s ± 0.040 s [User: 20.824 s, System: 2.350 s]<br class=""> Range (min … max): 3.529 s … 3.667 s 10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 100000<br class=""> Time (mean ± σ): 3.558 s ± 0.011 s [User: 28.272 s, System: 3.018 s]<br class=""> Range (min … max): 3.544 s … 3.581 s 10 runs<br class=""><br class="">Summary<br class=""> java Main -- 3 100000 ran<br class=""> 1.00 ± 0.01 times faster than java Main -- 4 100000<br class=""> 1.06 ± 0.02 times faster than java Main -- 2 100000<br class=""> 1.07 ± 0.02 times faster than java Main -- 1 100000<br class=""><br class="">Benchmark 1: java Main -- 1 1000000<br class=""> Time (mean ± σ): 34.182 s ± 0.020 s [User: 165.208 s, System: 15.016 s]<br class=""> Range (min … max): 34.148 s … 34.205 s 10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 1000000<br class=""> Time (mean ± σ): 34.917 s ± 0.270 s [User: 335.774 s, System: 23.993 s]<br class=""> Range (min … max): 34.575 s … 35.457 s 10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 1000000<br class=""> Time (mean ± σ): 36.146 s ± 3.468 s [User: 163.926 s, System: 10.816 s]<br class=""> Range (min … max): 34.209 s … 45.588 s 10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 1000000<br class=""> Time (mean ± σ): 34.390 s ± 0.051 s [User: 216.980 s, System: 22.147 s]<br class=""> Range (min … max): 34.303 s … 34.446 s 10 runs<br class=""><br class="">Summary<br class=""> java Main -- 1 1000000 ran<br class=""> 1.01 ± 0.00 times faster than java Main -- 4 1000000<br class=""> 1.02 ± 0.01 times faster than java Main -- 2 1000000<br class=""> 1.06 ± 0.10 times faster than java Main -- 3 1000000<br class=""><br class=""></div><div class=""><br class=""></div><div class=""># 16 CPUs<br class=""><br class="">Benchmark 1: java Main -- 1 10000<br class=""> Time (mean ± σ): 825.6 ms ± 50.6 ms [User: 6863.7 ms, System: 779.8 ms]<br class=""> Range (min … max): 745.2 ms … 936.4 ms 10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 10000<br class=""> Time (mean ± σ): 515.7 ms ± 8.4 ms [User: 4591.8 ms, System: 274.2 ms]<br class=""> Range (min … max): 502.7 ms … 532.9 ms 10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 10000<br class=""> Time (mean ± σ): 494.8 ms ± 34.9 ms [User: 3751.9 ms, System: 190.6 ms]<br class=""> Range (min … max): 463.3 ms … 580.0 ms 10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 10000<br class=""> Time (mean ± σ): 473.2 ms ± 7.5 ms [User: 3851.2 ms, System: 248.5 ms]<br class=""> Range (min … max): 464.1 ms … 483.3 ms 10 runs<br class=""><br class="">Summary<br class=""><br class=""> java Main -- 4 10000 ran<br class=""> 1.05 ± 0.08 times faster than java Main -- 3 10000<br class=""> 1.09 ± 0.02 times faster than java Main -- 2 10000<br class=""> 1.74 ± 0.11 times faster than java Main -- 1 10000<br class=""><br class="">Benchmark 1: java Main -- 1 100000<br class=""> Time (mean ± σ): 3.949 s ± 0.099 s [User: 32.922 s, System: 2.386 s]<br class=""> Range (min … max): 3.882 s … 4.222 s 10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 100000<br class=""> Time (mean ± σ): 3.739 s ± 0.063 s [User: 31.259 s, System: 2.117 s]<br class=""> Range (min … max): 3.661 s … 3.832 s 10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 100000<br class=""> Time (mean ± σ): 5.106 s ± 0.990 s [User: 20.897 s, System: 0.686 s]<br class=""> Range (min … max): 3.955 s … 7.679 s 10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 100000<br class=""> Time (mean ± σ): 3.570 s ± 0.013 s [User: 27.292 s, System: 1.807 s]<br class=""> Range (min … max): 3.557 s … 3.594 s 10 runs<br class=""><br class="">Summary<br class=""><br class=""> java Main -- 4 100000 ran<br class=""> 1.05 ± 0.02 times faster than java Main -- 2 100000<br class=""> 1.11 ± 0.03 times faster than java Main -- 1 100000<br class=""> 1.43 ± 0.28 times faster than java Main -- 3 100000<br class=""><br class="">Benchmark 1: java Main -- 1 1000000<br class=""> Time (mean ± σ): 34.536 s ± 0.076 s [User: 263.045 s, System: 16.227 s]<br class=""> Range (min … max): 34.457 s … 34.717 s 10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 1000000<br class=""> Time (mean ± σ): 35.257 s ± 0.191 s [User: 285.159 s, System: 19.702 s]<br class=""> Range (min … max): 35.015 s … 35.561 s 10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 1000000<br class=""> Time (mean ± σ): 60.157 s ± 10.130 s [User: 171.901 s, System: 3.902 s]<br class=""> Range (min … max): 46.437 s … 74.799 s 10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 1000000<br class=""> Time (mean ± σ): 34.370 s ± 0.067 s [User: 256.460 s, System: 19.605 s]<br class=""> Range (min … max): 34.299 s … 34.455 s 10 runs<br class=""><br class="">Summary<br class=""> java Main -- 4 1000000 ran<br class=""> 1.00 ± 0.00 times faster than java Main -- 1 1000000<br class=""> 1.03 ± 0.01 times faster than java Main -- 2 1000000<br class=""> 1.75 ± 0.29 times faster than java Main -- 3 1000000<br class=""></div></div></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, May 30, 2024 at 11:22 AM Ron Pressler <<a href="mailto:ron.pressler@oracle.com" target="_blank" class="">ron.pressler@oracle.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br class="">
<br class="">
> On 28 May 2024, at 23:43, Liam Miller-Cushon <<a href="mailto:cushon@google.com" target="_blank" class="">cushon@google.com</a>> wrote:<br class="">
> <br class="">
> Hello,<br class="">
> <br class="">
> JEP 444 [1] and the docs in [2] mention that virtual threads should not be pooled, and suggest semaphores as one possible alternative.<br class="">
> <br class="">
> My colleague Chi Wang has been investigating virtual thread performance, and has found that creating one virtual thread per task and synchronizing on a semaphore can result in worse performance on machines with large numbers of cores.<br class="">
> <br class="">
> A benchmark run on a 128 core machine is included below. It submits numTasks tasks to the executor determined by the strategy. The task itself is mixed with CPU and I/O work (simulated by fibonacci and sleep). The parallelism is set to 600 for all strategies.<br class="">
> <br class="">
> * Strategy 1 is the baseline where it just submits all the tasks to a ForkJoinPool whose pool size is 600.<br class="">
> * Strategy 2 uses the method suggested by JEP 444.<br class="">
> * Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.<br class="">
> <br class="">
> Note that, with 100K and 1M tasks, strategy 2 has a CPU time regression that seems to increase with the number of cores. This result can be reproduced on the real-world program that is being migrated to a virtual-thread-per-task model.<br class="">
> <br class="">
> Diffing the cpu profile between strategy 1 and strategy 2 showed that most of the CPU regression comes from method `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue, long, int)`.<br class="">
> <br class="">
> Are there any ideas for why the semaphore strategy uses more CPU than pooling virtual threads on machines with a large number of cores?<br class="">
<br class="">
My first guess would be that what’s being compared here is not the same algorithm. In strategy 2, you spawn a bunch of different threads that do a tiny bit of work before starting the task, while in the other strategies the algorithm is different.<br class="">
<br class="">
In a realistic workload you will have a number of threads serving requests, and then they may want to access a concurrency-limited resource. I think you can try to simulate that by simulating, say, 1000 tasks running in some large thread pool and then each of them would submit one subtasks to a fixed thread pool of size 10. For virtual threads, have 1000 virtual threads that try to acquire a semaphore with 10 leases.<br class="">
<br class="">
Can you try that? At least then we’ll be comparing oranges to oranges.<br class="">
<br class="">
— Ron</blockquote></div>
</div></blockquote></div><br class=""></div></div></div></body></html>