<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">To clarify, with option #2, you have potentially a million thread contention on the semaphore.<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On May 28, 2024, at 6:12 PM, robert engels <<a href="mailto:rengels@ix.netcom.com" class="">rengels@ix.netcom.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=utf-8" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">I don’t think this code is written correctly. With the implementation of option #2 - the system is going to create a virtual thread for every task up front, then try and schedule them on to the 128 carrier threads, then once a task runs, it is going to try and limit concurrency using a semaphore.<div class=""><br class=""></div><div class="">I think this is inverted, and you should have the Executor use the semaphore to limit the concurrent virtual thread creation in order to have similar performance.</div><div class=""><br class=""></div><div class="">The way it is written now, the virtual thread scheduler has to manage 1 million “virtual” threads - versus in the native thread impl the system is only managing 600.</div><div class=""><br class=""></div><div class="">The reason the CPU usage is much higher is because you have the semaphore code only enabled for option 2. You should be able to create and use the semaphore in all impl to compare similar performance, PLUS the scheduler having to manage scheduling 1M threads.</div><div class=""><br class=""></div><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On May 28, 2024, at 5:43 PM, Liam Miller-Cushon <<a href="mailto:cushon@google.com" class="">cushon@google.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Hello,<div class=""><br class="">JEP 444 [1] and the docs in [2] mention that virtual threads should not be pooled, and suggest semaphores as one possible alternative.</div><div class=""><br class="">My colleague Chi Wang has been investigating virtual thread performance, and has found that creating one virtual thread per task and synchronizing on a semaphore can result in worse performance on machines with large numbers of cores.<br class=""><br class="">A benchmark run on a 128 core machine is included below. It submits numTasks tasks to the executor determined by the strategy. The task itself is mixed with CPU and I/O work (simulated by fibonacci and sleep). The parallelism is set to 600 for all strategies.</div><div class=""><br class=""></div><div class="">* Strategy 1 is the baseline where it just submits all the tasks to a ForkJoinPool whose pool size is 600.<br class="">* Strategy 2 uses the method suggested by JEP 444.<br class="">* Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.<br class=""><br class="">Note that, with 100K and 1M tasks, strategy 2 has a CPU time regression that seems to increase with the number of cores. This result can be reproduced on the real-world program that is being migrated to a virtual-thread-per-task model.<br class=""><br class="">Diffing the cpu profile between strategy 1 and strategy 2 showed that most of the CPU regression comes from method `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue, long, int)`.<br class=""><br class="">Are there any ideas for why the semaphore strategy uses more CPU than pooling virtual threads on machines with a large number of cores?<br class=""><br class="">Thanks,<br class="">Liam<br class=""><br class="">[1] <a href="https://openjdk.org/jeps/444#Do-not-pool-virtual-threads" class="">https://openjdk.org/jeps/444#Do-not-pool-virtual-threads</a><br class="">[2] <a href="https://docs.oracle.com/en/java/javase/22/core/virtual-threads.html#GUID-2BCFC2DD-7D84-4B0C-9222-97F9C7C6C521" class="">https://docs.oracle.com/en/java/javase/22/core/virtual-threads.html#GUID-2BCFC2DD-7D84-4B0C-9222-97F9C7C6C521</a><br class=""><br class=""><font face="monospace" class="">import java.util.concurrent.ExecutorService;<br class="">import java.util.concurrent.Executors;<br class="">import java.util.concurrent.ForkJoinPool;<br class="">import java.util.concurrent.Semaphore;<br class=""><br class="">public class Main {<br class=""><br class=""> private static Semaphore semaphore = null;<br class=""><br class=""> public static void main(String[] args) {<br class=""> int strategy = 0;<br class=""> int parallelism = 600;<br class=""> int numTasks = 10000;<br class=""><br class=""> if (args.length > 1) {<br class=""> strategy = Integer.parseInt(args[1]);<br class=""> }<br class=""><br class=""> if (args.length > 2) {<br class=""> numTasks = Integer.parseInt(args[2]);<br class=""> }<br class=""><br class=""> ExecutorService executor;<br class=""> switch (strategy) {<br class=""> case 1 -> {<br class=""> executor = new ForkJoinPool(parallelism);<br class=""> }<br class=""> case 2 -> {<br class=""> executor = Executors.newVirtualThreadPerTaskExecutor();<br class=""> semaphore = new Semaphore(parallelism);<br class=""> }<br class=""> case 3 -> {<br class=""> executor = Executors.newFixedThreadPool(parallelism, Thread.ofVirtual().factory());<br class=""> }<br class=""> default -> {<br class=""> throw new IllegalArgumentException();<br class=""> }<br class=""> }<br class=""><br class=""> try (executor) {<br class=""> for (var i = 0; i < numTasks; ++i) {<br class=""> executor.execute(Main::task);<br class=""> }<br class=""> }<br class=""> }<br class=""><br class=""> private static void task() {<br class=""> if (semaphore != null) {<br class=""> try {<br class=""> semaphore.acquire();<br class=""> } catch (InterruptedException e) {<br class=""> throw new IllegalStateException();<br class=""> }<br class=""> }<br class=""><br class=""> try {<br class=""> fibonacci(20);<br class=""> try {<br class=""> Thread.sleep(10);<br class=""> } catch (InterruptedException e) {<br class=""> }<br class=""> fibonacci(20);<br class=""> try {<br class=""> Thread.sleep(10);<br class=""> } catch (InterruptedException e) {<br class=""> }<br class=""> fibonacci(20);<br class=""> } finally {<br class=""> if (semaphore != null) {<br class=""> semaphore.release();<br class=""> }<br class=""> }<br class=""> }<br class=""><br class=""> private static int fibonacci(int n) {<br class=""> if (n == 0) {<br class=""> return 0;<br class=""> } else if (n == 1) {<br class=""> return 1;<br class=""> } else {<br class=""> return fibonacci(n - 1) + fibonacci(n - 2);<br class=""> }<br class=""> }<br class="">}</font><br class=""><br class=""><font face="monospace" class=""># openjdk full version "23-ea+24-1995"<br class=""># AMD Ryzen Threadripper PRO 3995WX, hyperthreading enabled<br class=""><br class=""></font></div><div class=""><font face="monospace" class="">$ hyperfine --parameter-scan strategy 1 3 --parameter-list numTasks 10000,100000,1000000 './jdk-23/bin/java Main -- {strategy} {numTasks}'</font></div><div class=""><font face="monospace" class=""><br class="">Benchmark 1: ./jdk-23/bin/java Main -- 1 10000<br class=""> Time (mean ± σ): 658.3 ms ± 24.4 ms [User: 26808.8 ms, System: 493.7 ms]</font><br class=""><font face="monospace" class=""> Range (min … max): 613.1 ms … 702.0 ms 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 2: ./jdk-23/bin/java Main -- 2 10000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 486.9 ms ± 28.5 ms [User: 14804.4 ms, System: 501.5 ms]</font><br class=""><font face="monospace" class=""> Range (min … max): 451.0 ms … 533.4 ms 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 3: ./jdk-23/bin/java Main -- 3 10000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 452.0 ms ± 10.8 ms [User: 6598.1 ms, System: 335.8 ms]</font><br class=""><font face="monospace" class=""> Range (min … max): 435.6 ms … 470.6 ms 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 4: ./jdk-23/bin/java Main -- 1 100000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 3.668 s ± 0.028 s [User: 38.469 s, System: 1.188 s]</font><br class=""><font face="monospace" class=""> Range (min … max): 3.628 s … 3.704 s 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 5: ./jdk-23/bin/java Main -- 2 100000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 3.612 s ± 0.042 s [User: 65.924 s, System: 2.072 s]</font><br class=""><font face="monospace" class=""> Range (min … max): 3.563 s … 3.687 s 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 6: ./jdk-23/bin/java Main -- 3 100000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 3.503 s ± 0.008 s [User: 27.791 s, System: 1.211 s]</font><br class=""><font face="monospace" class=""> Range (min … max): 3.492 s … 3.515 s 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 7: ./jdk-23/bin/java Main -- 1 1000000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 34.093 s ± 0.031 s [User: 206.235 s, System: 14.313 s]</font><br class=""><font face="monospace" class=""> Range (min … max): 34.015 s … 34.120 s 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 8: ./jdk-23/bin/java Main -- 2 1000000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 34.354 s ± 0.063 s [User: 330.215 s, System: 17.501 s]</font><br class=""><font face="monospace" class=""> Range (min … max): 34.267 s … 34.479 s 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Benchmark 9: ./jdk-23/bin/java Main -- 3 1000000</font><br class=""><font face="monospace" class=""> Time (mean ± σ): 34.551 s ± 1.018 s [User: 238.050 s, System: 10.258 s]</font><br class=""><font face="monospace" class=""> Range (min … max): 34.124 s … 37.420 s 10 runs</font><br class=""><font face="monospace" class=""> </font><br class=""><font face="monospace" class="">Summary</font><br class=""><font face="monospace" class=""> ./jdk-23/bin/java Main -- 3 10000 ran</font><br class=""><font face="monospace" class=""> 1.08 ± 0.07 times faster than ./jdk-23/bin/java Main -- 2 10000</font><br class=""><font face="monospace" class=""> 1.46 ± 0.06 times faster than ./jdk-23/bin/java Main -- 1 10000</font><br class=""><font face="monospace" class=""> 7.75 ± 0.19 times faster than ./jdk-23/bin/java Main -- 3 100000</font><br class=""><font face="monospace" class=""> 7.99 ± 0.21 times faster than ./jdk-23/bin/java Main -- 2 100000</font><br class=""><font face="monospace" class=""> 8.12 ± 0.20 times faster than ./jdk-23/bin/java Main -- 1 100000</font><br class=""><font face="monospace" class=""> 75.43 ± 1.80 times faster than ./jdk-23/bin/java Main -- 1 1000000</font><br class=""><font face="monospace" class=""> 76.01 ± 1.82 times faster than ./jdk-23/bin/java Main -- 2 1000000</font><br class=""><font face="monospace" class=""> 76.44 ± 2.90 times faster than ./jdk-23/bin/java Main -- 3 1000000</font><br class=""></div></div>
</div></blockquote></div><br class=""></div></div></div></blockquote></div><br class=""></body></html>