<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Thinking about this some more, the numbers are as expected. It doesn’t matter if you have 16 cores or 128 cores - the wall time will be the same.<div class=""><br class=""></div><div class="">All of the tasks are essentially “waiting” nearly 100% of the time.</div><div class=""><br class=""></div><div class="">Assume the wall time per task is 20 ms. If you have 1_000_000 tasks and you can execute 600 at the same time it will be (1_000_000 / 600) * 20 ms = 33 secs, which is basically what the wall time is in every case.</div><div class=""><br class=""></div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">If you look at the CPU (user & system) time - which in this case equates to the scheduler time + overhead (e.g. semaphore handling):</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></div><div class=""><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">java Main -- 4 1000000  153.96s user 4.66s system 432% cpu 36.678 total</span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">java Main -- 3 1000000  149.55s user 4.68s system 414% cpu 37.184 total</span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">It is 5 seconds on 1 million tasks, or an additional overhead of 5 microseconds on a task that takes 20 ms, or 0.025%.</span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">When the number of carrier threads is increased beyond the physical cores, it is a different story:</span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></span></div><div class=""><font color="#000000" class=""><span style="caret-color: rgb(0, 0, 0);" class="">java -Djdk.virtualThreadScheduler.parallelism=128 Main -- 4 1000000  189.07s user 11.25s system 522% cpu 38.345 total</span></font></div><div class=""><font color="#000000" class="">java -Djdk.virtualThreadScheduler.parallelism=128 Main -- 3 1000000  151.15s user 4.25s system 417% cpu 37.242 total</font></div><div class=""><font color="#000000" class=""><br class=""></font></div><div class=""><font color="#000000" class="">and the flame graphs tell an interesting story. With the pooled thread executor, when a thread completes a task it checks if there is a ready task to grab before going to sleep (the two different fibonacci peaks and the different execution paths), but in the new thread per task model that is not an option, so instead, the fork-join carrier pool needs to wake up to look for a VT ready to run virtual thread (thus the much higher system time).</font></div><div class=""><font color="#000000" class=""><br class=""></font></div><div class=""><font color="#000000" class="">In the first test above (parallelism default), there are so few carrier threads and 600 virtual threads, so there is almost always a VT ready to run, so the scheduler is not parking/waking.</font></div><div class=""><font color="#000000" class=""><br class=""></font></div><div class=""><font color="#000000" class=""><br class=""></font></div><div class=""><img apple-inline="yes" id="74C91281-76FA-4EA4-A176-3376926175AB" src="cid:C5D342DA-C13B-41DE-BFBD-EAC51378397F" class=""><img apple-inline="yes" id="1E05E295-F469-4B12-BAAD-DE3A03C32BC6" src="cid:260C9353-BAB3-439C-BC1D-00EC84B37438" class=""></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></span></div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></div></div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></div><div class=""><div><blockquote type="cite" class=""><div class="">On Jun 4, 2024, at 7:16 PM, Liam Miller-Cushon <<a href="mailto:cushon@google.com" class="">cushon@google.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="">Yes, I think the Ryzen 3995WX has 64 real cores, 128 was with hyperthreading enabled.</div><div class=""><br class=""></div><div class="">I don't have numbers from it yet (it isn't my machine), I'll get numbers with hyperthreading disabled and with your <span style="" class="">VirtualThreadExecutorService approach from [1] and report back.</span></div><div class=""><br class=""></div><div class=""><span style="" class="">[1] </span><a href="https://mail.openjdk.org/pipermail/loom-dev/2024-May/006627.html" class="">https://mail.openjdk.org/pipermail/loom-dev/2024-May/006627.html</a></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 4, 2024 at 4:35 PM Robert Engels <<a href="mailto:robaho@icloud.com" class="">robaho@icloud.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class="">I know I sent a bunch of replies, but did you by chance run the tests using the VirtualThreadExecutorService that doesn’t have the semaphore?<div class=""><br class=""></div><div class="">But, more importantly, as I also pointed out, something seems way off in the timings. I don’t think the 128 core machine is actually 128 available cores - the timings are way too high. With this test you should get a nearly linear increase in performance as the number of carrier threads increase. The tests here aren’t even close. I am guessing these are virtual cores backed by their own scheduler on a shared VM.</div><div class=""><br class=""></div><div class="">With virtual threads the number of carrier threads needs to match the real available cores or you will waste CPU with thrashing of the scheduler.</div><div class=""><div class=""><div class=""><br class=""></div><div class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Jun 4, 2024, at 6:22 PM, Liam Miller-Cushon <<a href="mailto:cushon@google.com" target="_blank" class="">cushon@google.com</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class="">Hi,<br class=""><br class="">Thanks for the comments. There's more to dig into here and to respond to, but I have a few notes for now.<div class=""><br class=""></div><div class="">re: whether the question is about making this code faster or why case (2) scales weirdly, it's sort of both--we're interested in understanding (2) better, but if (2) is expected to be expensive compared to pushing back, that's helpful.<br class=""><br class=""><div class="">To provide some more color on the real-world program that inspired the question, it's the Bazel build system (<a href="https://github.com/google/bazel" target="_blank" class="">https://github.com/google/bazel</a>). Bazel constructs a graph of build 'actions', where each node is a command to run during the build (for example a call to javac that takes some source files and classpath jars as inputs, and produces class file outputs). The current approach uses a pool of physical threads to execute actions. Actions can block on IO, and especially on RPCs to remote machines that are performing build steps. The goal is to use virtual threads to make the actions async, and to replace the limit on the number of concurrent actions with a dynamic limit based on the availability of other resources (memory, and the number of in-flight RPCs to remote build machines). As a first step, Chi was trying to get a baseline where the virtual thread-based implementation used the same concurrency limit as the previous approach, ideally with roughly the same CPU overhead.<div class=""><br class=""></div><div class="">I have more benchmarking results from the original machine using one of the approaches that was suggested, of limiting the number of in-flight virtual threads using a ThreadFactory with a semaphore. The new strategy (4) uses less CPU than strategy (2) where all of the virtual threads are started at the same time, but still uses more CPU than the thread pooling approach in (3).<br class=""><br class="">With strategy 4 as:<br class=""><br class="">```<br class="">      case 4 -> {<br class="">        var factorySem = new Semaphore(parallelism);<br class="">        ThreadFactory tf = (Runnable r) -> {<br class="">          try {<br class="">            factorySem.acquire();<br class="">          } catch (InterruptedException ex) {<br class="">            throw new IllegalStateException("interrupted");<br class="">          }<br class="">          return Thread.ofVirtual().unstarted(() -> <br class="">              {<br class="">                try { <br class="">                  r.run(); <br class="">                } finally { <br class="">                  factorySem.release();<br class="">                }<br class="">              });<br class="">        };<br class="">        executor = Executors.newThreadPerTaskExecutor(tf);<br class="">```<br class=""><br class="">The results are:<br class=""><br class=""># 128 CPUs<br class=""><br class="">Benchmark 1: java Main -- 1 10000<br class="">  Time (mean ± σ):     719.7 ms ±  45.9 ms    [User: 23732.3 ms, System: 1155.3 ms]<br class="">  Range (min … max):   651.1 ms … 794.0 ms    10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 10000<br class="">  Time (mean ± σ):     635.8 ms ±  66.9 ms    [User: 21271.1 ms, System: 1685.1 ms]<br class="">  Range (min … max):   565.5 ms … 769.6 ms    10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 10000<br class="">  Time (mean ± σ):     486.2 ms ±   9.3 ms    [User: 10804.0 ms, System: 1181.6 ms]<br class="">  Range (min … max):   464.5 ms … 495.1 ms    10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 10000<br class="">  Time (mean ± σ):     479.9 ms ±   3.7 ms    [User: 8366.7 ms, System: 1201.9 ms]<br class="">  Range (min … max):   474.8 ms … 488.0 ms    10 runs<br class=""><br class="">Summary<br class="">  java Main -- 4 10000 ran<br class="">    1.01 ± 0.02 times faster than java Main -- 3 10000<br class="">    1.32 ± 0.14 times faster than java Main -- 2 10000<br class="">    1.50 ± 0.10 times faster than java Main -- 1 10000<br class=""><br class="">Benchmark 1: java Main -- 1 100000<br class="">  Time (mean ± σ):      3.803 s ±  0.044 s    [User: 41.227 s, System: 2.418 s]<br class="">  Range (min … max):    3.737 s …  3.874 s    10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 100000<br class="">  Time (mean ± σ):      3.761 s ±  0.063 s    [User: 53.494 s, System: 3.380 s]<br class="">  Range (min … max):    3.651 s …  3.848 s    10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 100000<br class="">  Time (mean ± σ):      3.555 s ±  0.040 s    [User: 20.824 s, System: 2.350 s]<br class="">  Range (min … max):    3.529 s …  3.667 s    10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 100000<br class="">  Time (mean ± σ):      3.558 s ±  0.011 s    [User: 28.272 s, System: 3.018 s]<br class="">  Range (min … max):    3.544 s …  3.581 s    10 runs<br class=""><br class="">Summary<br class="">  java Main -- 3 100000 ran<br class="">    1.00 ± 0.01 times faster than java Main -- 4 100000<br class="">    1.06 ± 0.02 times faster than java Main -- 2 100000<br class="">    1.07 ± 0.02 times faster than java Main -- 1 100000<br class=""><br class="">Benchmark 1: java Main -- 1 1000000<br class="">  Time (mean ± σ):     34.182 s ±  0.020 s    [User: 165.208 s, System: 15.016 s]<br class="">  Range (min … max):   34.148 s … 34.205 s    10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 1000000<br class="">  Time (mean ± σ):     34.917 s ±  0.270 s    [User: 335.774 s, System: 23.993 s]<br class="">  Range (min … max):   34.575 s … 35.457 s    10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 1000000<br class="">  Time (mean ± σ):     36.146 s ±  3.468 s    [User: 163.926 s, System: 10.816 s]<br class="">  Range (min … max):   34.209 s … 45.588 s    10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 1000000<br class="">  Time (mean ± σ):     34.390 s ±  0.051 s    [User: 216.980 s, System: 22.147 s]<br class="">  Range (min … max):   34.303 s … 34.446 s    10 runs<br class=""><br class="">Summary<br class="">  java Main -- 1 1000000 ran<br class="">    1.01 ± 0.00 times faster than java Main -- 4 1000000<br class="">    1.02 ± 0.01 times faster than java Main -- 2 1000000<br class="">    1.06 ± 0.10 times faster than java Main -- 3 1000000<br class=""><br class=""></div><div class=""><br class=""></div><div class=""># 16 CPUs<br class=""><br class="">Benchmark 1: java Main -- 1 10000<br class="">  Time (mean ± σ):     825.6 ms ±  50.6 ms    [User: 6863.7 ms, System: 779.8 ms]<br class="">  Range (min … max):   745.2 ms … 936.4 ms    10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 10000<br class="">  Time (mean ± σ):     515.7 ms ±   8.4 ms    [User: 4591.8 ms, System: 274.2 ms]<br class="">  Range (min … max):   502.7 ms … 532.9 ms    10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 10000<br class="">  Time (mean ± σ):     494.8 ms ±  34.9 ms    [User: 3751.9 ms, System: 190.6 ms]<br class="">  Range (min … max):   463.3 ms … 580.0 ms    10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 10000<br class="">  Time (mean ± σ):     473.2 ms ±   7.5 ms    [User: 3851.2 ms, System: 248.5 ms]<br class="">  Range (min … max):   464.1 ms … 483.3 ms    10 runs<br class=""><br class="">Summary<br class=""><br class="">  java Main -- 4 10000 ran<br class="">    1.05 ± 0.08 times faster than java Main -- 3 10000<br class="">    1.09 ± 0.02 times faster than java Main -- 2 10000<br class="">    1.74 ± 0.11 times faster than java Main -- 1 10000<br class=""><br class="">Benchmark 1: java Main -- 1 100000<br class="">  Time (mean ± σ):      3.949 s ±  0.099 s    [User: 32.922 s, System: 2.386 s]<br class="">  Range (min … max):    3.882 s …  4.222 s    10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 100000<br class="">  Time (mean ± σ):      3.739 s ±  0.063 s    [User: 31.259 s, System: 2.117 s]<br class="">  Range (min … max):    3.661 s …  3.832 s    10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 100000<br class="">  Time (mean ± σ):      5.106 s ±  0.990 s    [User: 20.897 s, System: 0.686 s]<br class="">  Range (min … max):    3.955 s …  7.679 s    10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 100000<br class="">  Time (mean ± σ):      3.570 s ±  0.013 s    [User: 27.292 s, System: 1.807 s]<br class="">  Range (min … max):    3.557 s …  3.594 s    10 runs<br class=""><br class="">Summary<br class=""><br class="">  java Main -- 4 100000 ran<br class="">    1.05 ± 0.02 times faster than java Main -- 2 100000<br class="">    1.11 ± 0.03 times faster than java Main -- 1 100000<br class="">    1.43 ± 0.28 times faster than java Main -- 3 100000<br class=""><br class="">Benchmark 1: java Main -- 1 1000000<br class="">  Time (mean ± σ):     34.536 s ±  0.076 s    [User: 263.045 s, System: 16.227 s]<br class="">  Range (min … max):   34.457 s … 34.717 s    10 runs<br class=""><br class="">Benchmark 2: java Main -- 2 1000000<br class="">  Time (mean ± σ):     35.257 s ±  0.191 s    [User: 285.159 s, System: 19.702 s]<br class="">  Range (min … max):   35.015 s … 35.561 s    10 runs<br class=""><br class="">Benchmark 3: java Main -- 3 1000000<br class="">  Time (mean ± σ):     60.157 s ± 10.130 s    [User: 171.901 s, System: 3.902 s]<br class="">  Range (min … max):   46.437 s … 74.799 s    10 runs<br class=""><br class="">Benchmark 4: java Main -- 4 1000000<br class="">  Time (mean ± σ):     34.370 s ±  0.067 s    [User: 256.460 s, System: 19.605 s]<br class="">  Range (min … max):   34.299 s … 34.455 s    10 runs<br class=""><br class="">Summary<br class="">  java Main -- 4 1000000 ran<br class="">    1.00 ± 0.00 times faster than java Main -- 1 1000000<br class="">    1.03 ± 0.01 times faster than java Main -- 2 1000000<br class="">    1.75 ± 0.29 times faster than java Main -- 3 1000000<br class=""></div></div></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, May 30, 2024 at 11:22 AM Ron Pressler <<a href="mailto:ron.pressler@oracle.com" target="_blank" class="">ron.pressler@oracle.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br class="">

<br class="">

> On 28 May 2024, at 23:43, Liam Miller-Cushon <<a href="mailto:cushon@google.com" target="_blank" class="">cushon@google.com</a>> wrote:<br class="">

> <br class="">

> Hello,<br class="">

> <br class="">

> JEP 444 [1] and the docs in [2] mention that virtual threads should not be pooled, and suggest semaphores as one possible alternative.<br class="">

> <br class="">

> My colleague Chi Wang has been investigating virtual thread performance, and has found that creating one virtual thread per task and synchronizing on a semaphore can result in worse performance on machines with large numbers of cores.<br class="">

> <br class="">

> A benchmark run on a 128 core machine is included below. It submits numTasks tasks to the executor determined by the strategy. The task itself is mixed with CPU and I/O work (simulated by fibonacci and sleep). The parallelism is set to 600 for all strategies.<br class="">

> <br class="">

> * Strategy 1 is the baseline where it just submits all the tasks to a ForkJoinPool whose pool size is 600.<br class="">

> * Strategy 2 uses the method suggested by JEP 444.<br class="">

> * Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.<br class="">

> <br class="">

> Note that, with 100K and 1M tasks, strategy 2 has a CPU time regression that seems to increase with the number of cores. This result can be reproduced on the real-world program that is being migrated to a virtual-thread-per-task model.<br class="">

> <br class="">

> Diffing the cpu profile between strategy 1 and strategy 2 showed that most of the CPU regression comes from method `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue, long, int)`.<br class="">

> <br class="">

> Are there any ideas for why the semaphore strategy uses more CPU than pooling virtual threads on machines with a large number of cores?<br class="">

<br class="">

My first guess would be that what’s being compared here is not the same algorithm. In strategy 2, you spawn a bunch of different threads that do a tiny bit of work before starting the task, while in the other strategies the algorithm is different.<br class="">

<br class="">

In a realistic workload you will have a number of threads serving requests, and then they may want to access a concurrency-limited resource. I think you can try to simulate that by simulating, say, 1000 tasks running in some large thread pool and then each of them would submit one subtasks to a fixed thread pool of size 10. For virtual threads, have 1000 virtual threads that try to acquire a semaphore with 10 leases.<br class="">

<br class="">

Can you try that? At least then we’ll be comparing oranges to oranges.<br class="">

<br class="">

— Ron</blockquote></div>

</div></blockquote></div><br class=""></div></div></div></div></blockquote></div>

</div></blockquote></div><br class=""></div></body></html>