<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Attached is a version that doesn’t throttle the submitter (it is not fully correct but enough for testing). It is better than scenario 2, not as good as 3. The increased number of carrier threads (without cores) hurts performance - as is expected. I would guess if the OP tries scenario 4 on his test machine, it will perform similar to scenario 3.<div class=""><br class=""></div><div class=""><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">iMac:vt_test robertengels$ time java -Djdk.virtualThreadScheduler.parallelism=128 Main dummy 4 1000000</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">all tasks submitted</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85); min-height: 19px;" class=""><span style="font-variant-ligatures: no-common-ligatures" class=""></span><br class=""></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">real<span class="Apple-tab-span" style="white-space:pre">    </span>0m37.820s</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">user<span class="Apple-tab-span" style="white-space:pre">      </span>2m56.453s</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">sys<span class="Apple-tab-span" style="white-space:pre">       </span>0m9.973s</span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">iMac:vt_test robertengels$ time java Main dummy 4 1000000</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">all tasks submitted</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85); min-height: 19px;" class=""><span style="font-variant-ligatures: no-common-ligatures" class=""></span><br class=""></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">real<span class="Apple-tab-span" style="white-space:pre">        </span>0m37.992s</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">user<span class="Apple-tab-span" style="white-space:pre">      </span>2m25.169s</span></div><div style="margin: 0px; font-stretch: normal; font-size: 14px; line-height: normal; font-family: Monaco; color: rgb(16, 16, 16); background-color: rgba(255, 255, 255, 0.85);" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">sys<span class="Apple-tab-span" style="white-space:pre">       </span>0m3.943s</span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class="">import java.util.Collection;<br class="">import java.util.List;<br class="">import java.util.concurrent.AbstractExecutorService;<br class="">import java.util.concurrent.Callable;<br class="">import java.util.concurrent.ConcurrentLinkedQueue;<br class="">import java.util.concurrent.ExecutionException;<br class="">import java.util.concurrent.ExecutorService;<br class="">import java.util.concurrent.Executors;<br class="">import java.util.concurrent.ForkJoinPool;<br class="">import java.util.concurrent.Future;<br class="">import java.util.concurrent.Semaphore;<br class="">import java.util.concurrent.ThreadFactory;<br class="">import java.util.concurrent.TimeUnit;<br class="">import java.util.concurrent.TimeoutException;<br class="">import java.util.concurrent.atomic.AtomicInteger;<br class="">import java.util.concurrent.locks.LockSupport;<br class=""><br class="">public class Main {<br class=""><br class="">  private static Semaphore semaphore = null;<br class="">  private static int sink = 0;<br class=""><br class="">  public static void main(String[] args) {<br class="">    int strategy = 0;<br class="">    int parallelism = 600;<br class="">    int numTasks = 10000;<br class=""><br class="">    if (args.length > 1) {<br class="">      strategy = Integer.parseInt(args[1]);<br class="">    }<br class=""><br class="">    if (args.length > 2) {<br class="">      numTasks = Integer.parseInt(args[2]);<br class="">    }<br class=""><br class="">    ExecutorService executor;<br class="">    switch (strategy) {<br class="">      case 1 -> {<br class="">        executor = new ForkJoinPool(parallelism);<br class="">      }<br class="">      case 2 -> {<br class="">        executor = Executors.newVirtualThreadPerTaskExecutor();<br class="">        semaphore = new Semaphore(parallelism);<br class="">      }<br class="">      case 3 -> {<br class="">        executor = Executors.newFixedThreadPool(parallelism, Thread.ofVirtual().factory());<br class="">      }<br class="">      case 4 -> {<br class="">        executor = new VirtualThreadExecutorService(parallelism);<br class="">      }<br class="">      default -> {<br class="">        throw new IllegalArgumentException();<br class="">      }<br class="">    }<br class=""><br class="">    try (executor) {<br class="">      for (var i = 0; i < numTasks; ++i) {<br class="">        executor.execute(Main::task);<br class="">      }<br class="">      System.out.println("all tasks submitted");<br class="">    }<br class="">  }<br class=""><br class="">  private static void task() {<br class="">    if (semaphore != null) {<br class="">      try {<br class="">        semaphore.acquire();<br class="">      } catch (InterruptedException e) {<br class="">        throw new IllegalStateException();<br class="">      }<br class="">    }<br class=""><br class="">    try {<br class="">      Main:sink += fibonacci(20);<br class="">      try {<br class="">        Thread.sleep(10);<br class="">      } catch (InterruptedException e) {<br class="">      }<br class="">      Main:sink += fibonacci(20);<br class="">      try {<br class="">        Thread.sleep(10);<br class="">      } catch (InterruptedException e) {<br class="">      }<br class="">      Main:sink += fibonacci(20);<br class="">    } finally {<br class="">      if (semaphore != null) {<br class="">        semaphore.release();<br class="">      }<br class="">    }<br class="">  }<br class=""><br class="">  private static int fibonacci(int n) {<br class="">    if (n == 0) {<br class="">      return 0;<br class="">    } else if (n == 1) {<br class="">      return 1;<br class="">    } else {<br class="">      return fibonacci(n - 1) + fibonacci(n - 2);<br class="">    }<br class="">  }<br class="">}<br class=""><br class="">final class VirtualThreadExecutorService extends AbstractExecutorService {<br class="">    private final ConcurrentLinkedQueue<Runnable> queue = new ConcurrentLinkedQueue<>();<br class="">    private final AtomicInteger count = new AtomicInteger(0);<br class="">    private final Semaphore semaphore;<br class="">    private volatile boolean shutdown = false;<br class="">    private volatile Thread waiter = null;<br class=""><br class="">    public VirtualThreadExecutorService(int maxConcurrency) {<br class="">        semaphore = new Semaphore(maxConcurrency);<br class="">    }<br class=""><br class="">    @Override<br class="">    public void shutdown() {<br class="">        shutdown=true;<br class="">    }<br class=""><br class="">    @Override<br class="">    public List<Runnable> shutdownNow() {<br class="">        throw new UnsupportedOperationException("Not supported yet.");<br class="">    }<br class=""><br class="">    @Override<br class="">    public boolean isShutdown() {<br class="">        return shutdown;<br class="">    }<br class=""><br class="">    @Override<br class="">    public boolean isTerminated() {<br class="">        return count.get()==0;<br class="">    }<br class=""><br class="">    @Override<br class="">    public boolean awaitTermination(long timeout, TimeUnit unit) throws InterruptedException {<br class="">        waiter = Thread.currentThread();<br class="">        while(count.get()>0) {<br class="">            LockSupport.park();<br class="">        }<br class="">        return true;<br class="">    }<br class=""><br class="">    private void maybeStartAnother() {<br class="">        if(!queue.isEmpty() && semaphore.tryAcquire()) {<br class="">            Runnable r = queue.poll();<br class="">            if(r!=null) {<br class="">                Thread.ofVirtual().start(r);<br class="">            } else {<br class="">                semaphore.release();<br class="">            }<br class="">        }<br class="">    }<br class=""><br class="">    @Override<br class="">    public void execute(Runnable command) {<br class="">        if(shutdown) throw new IllegalStateException("executor is shutdown");<br class="">        count.incrementAndGet();<br class="">        queue.add(() -> {<br class="">            command.run();<br class="">            semaphore.release();<br class="">            if(count.decrementAndGet()==0) {<br class="">                LockSupport.unpark(waiter);<br class="">            }<br class="">            maybeStartAnother();<br class="">        });<br class="">        maybeStartAnother();<br class="">    }<br class="">}<br class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div></span></div><div><br class=""><blockquote type="cite" class=""><div class="">On May 30, 2024, at 9:50 AM, robert engels <<a href="mailto:rengels@ix.netcom.com" class="">rengels@ix.netcom.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=utf-8" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">A minor rework would fix that - but the ThreadPerTaskExecutor is not public, so it would have been more work. The total task submission time is insignificant - that is not the issue. It is only added an entry per task to concurrent linked list.<div class=""><br class=""></div><div class="">My tests are easily outperforming if I only have 4 cores and the reporter has 64 - on an ops per cpu basis. The OP also states: "<font class="">A benchmark run on a 128 core machine is included below.” So, it may be 64, but it probably has dual hardware threads. Also, look at the user and system cpu times - those are more important than wall time if you are attempting to look at efficiency of schedulers. My reported times are way lower across every scenario - which is why it appears something else may be affecting the OP’s testing.</font></div><div class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On May 30, 2024, at 9:08 AM, Attila Kelemen <<a href="mailto:attila.kelemen85@gmail.com" class="">attila.kelemen85@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Your case 4 is unfair. It doesn't have the same behavior as the other 3, because you are pushing back (i.e., you are limiting the task producer thread) which is of course more efficient. Even though pushing back is normally necessary in real world code, you are just simply measuring something different, and I suppose the question is not how to make this code faster, but why case 2 scales so weirdly with the number of tasks compared to the other two scenarios.<div class=""><br class=""></div><div class="">Also, I might be misreading something, but you are not outperforming the reported numbers. You are about 4s slower. FYI: Liam had the CPU near the end of his email: "AMD Ryzen Threadripper PRO 3995WX" (64 real cores).</div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">robert engels <<a href="mailto:rengels@ix.netcom.com" class="">rengels@ix.netcom.com</a>> ezt írta (időpont: 2024. máj. 30., Cs, 15:44):<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class="">As a somewhat important aside, I am guessing based on the original reported timings, that it is not a real 128 core machine. It is most likely 128 virtual cores, and probably shared. I don’t think you should be performing CPU benchmarks in those environments. The fact that my 8 core machine (only 4 real cores) is outperforming a 128 core machine is not a good sign.<br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On May 30, 2024, at 8:35 AM, robert engels <<a href="mailto:rengels@ix.netcom.com" target="_blank" class="">rengels@ix.netcom.com</a>> wrote:</div><br class=""><div class=""><div style="overflow-wrap: break-word;" class="">Reworking the design (scenario 4) brings the pooling and new VT per task in line:<div class=""><br class=""></div><div class=""><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class="">iMac:vt_test robertengels$ time java -Djdk.virtualThreadScheduler.parallelism=128 Main dummy 2 1000000</div></div><div class=""><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85);min-height:19px" class=""><span style="font-variant-ligatures:no-common-ligatures" class=""></span><br class=""></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">real<span style="white-space:pre-wrap" class="">     </span>0m38.405s</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">user<span style="white-space:pre-wrap" class="">   </span>3m42.240s</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">sys<span style="white-space:pre-wrap" class="">    </span>0m12.976s</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class=""><br class=""></span></div><div class=""><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">iMac:vt_test robertengels$ time java -Djdk.virtualThreadScheduler.parallelism=128 Main dummy 3 1000000</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85);min-height:19px" class=""><span style="font-variant-ligatures:no-common-ligatures" class=""></span><br class=""></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">real<span style="white-space:pre-wrap" class="">      </span>0m37.710s</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">user<span style="white-space:pre-wrap" class="">   </span>2m28.901s</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">sys<span style="white-space:pre-wrap" class="">    </span>0m3.427s</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Monaco;color:rgb(16,16,16);background-color:rgba(255,255,255,0.85)" class=""><span style="font-variant-ligatures:no-common-ligatures" class=""><div class=""><div style="margin:0px;font-stretch:normal;line-height:normal" class=""><span style="font-variant-ligatures:no-common-ligatures" class=""><br class=""></span></div><div style="margin:0px;font-stretch:normal;line-height:normal" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">iMac:vt_test robertengels$ time java -Djdk.virtualThreadScheduler.parallelism=128 Main dummy 4 1000000</span></div><div style="margin:0px;font-stretch:normal;line-height:normal;min-height:19px" class=""><span style="font-variant-ligatures:no-common-ligatures" class=""></span><br class=""></div><div style="margin:0px;font-stretch:normal;line-height:normal" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">real<span style="white-space:pre-wrap" class="">   </span>0m38.441s</span></div><div style="margin:0px;font-stretch:normal;line-height:normal" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">user<span style="white-space:pre-wrap" class=""> </span>2m39.547s</span></div><div style="margin:0px;font-stretch:normal;line-height:normal" class=""><span style="font-variant-ligatures:no-common-ligatures" class="">sys<span style="white-space:pre-wrap" class="">  </span>0m7.027s</span></div></div><div class=""></div></span></div><div class=""><br class=""></div><div class="">My machine only has 8 real cores, so I expect greater system cpu usage due to kernel thread scheduling. There is also the additional semaphore management.</div><div class=""><br class=""></div><div class="">In scenario 2, the system is starting and scheduling all 1 million threads, just to have them go park, waiting on the semaphore, then to be woken up and rescheduled. This is not insignificant. Scenario 4 avoid this.</div><div class=""><br class=""></div><div class="">import java.util.concurrent.ExecutorService;<br class="">import java.util.concurrent.Executors;<br class="">import java.util.concurrent.ForkJoinPool;<br class="">import java.util.concurrent.Semaphore;<br class="">import java.util.concurrent.ThreadFactory;<br class=""><br class="">public class Main {<br class=""><br class="">  private static Semaphore semaphore = null;<br class="">  private static int sink = 0;<br class=""><br class="">  public static void main(String[] args) {<br class="">    int strategy = 0;<br class="">    int parallelism = 600;<br class="">    int numTasks = 10000;<br class=""><br class="">    if (args.length > 1) {<br class="">      strategy = Integer.parseInt(args[1]);<br class="">    }<br class=""><br class="">    if (args.length > 2) {<br class="">      numTasks = Integer.parseInt(args[2]);<br class="">    }<br class=""><br class="">    ExecutorService executor;<br class="">    switch (strategy) {<br class="">      case 1 -> {<br class="">        executor = new ForkJoinPool(parallelism);<br class="">      }<br class="">      case 2 -> {<br class="">        executor = Executors.newVirtualThreadPerTaskExecutor();<br class="">        semaphore = new Semaphore(parallelism);<br class="">      }<br class="">      case 3 -> {<br class="">        executor = Executors.newFixedThreadPool(parallelism, Thread.ofVirtual().factory());<br class="">      }<br class="">      case 4 -> {<br class="">        var factorySem = new Semaphore(parallelism);<br class="">        ThreadFactory tf = (Runnable r) -> {<br class="">            try {<br class="">                factorySem.acquire();<br class="">            } catch (InterruptedException ex) {<br class="">                throw new IllegalStateException("interrupted");<br class="">            }<br class="">            return Thread.ofVirtual().unstarted(() -> <br class="">                {<br class="">                    try { <br class="">                        r.run(); <br class="">                    } finally { <br class="">                        factorySem.release();<br class="">                    }<br class="">                });<br class="">        };<br class="">        executor = Executors.newThreadPerTaskExecutor(tf);<br class="">      }<br class="">      default -> {<br class="">        throw new IllegalArgumentException();<br class="">      }<br class="">    }<br class=""><br class="">    try (executor) {<br class="">      for (var i = 0; i < numTasks; ++i) {<br class="">        executor.execute(Main::task);<br class="">      }<br class="">    }<br class="">  }<br class=""><br class="">  private static void task() {<br class="">    if (semaphore != null) {<br class="">      try {<br class="">        semaphore.acquire();<br class="">      } catch (InterruptedException e) {<br class="">        throw new IllegalStateException();<br class="">      }<br class="">    }<br class=""><br class="">    try {<br class="">      Main:sink += fibonacci(20);<br class="">      try {<br class="">        Thread.sleep(10);<br class="">      } catch (InterruptedException e) {<br class="">      }<br class="">      Main:sink += fibonacci(20);<br class="">      try {<br class="">        Thread.sleep(10);<br class="">      } catch (InterruptedException e) {<br class="">      }<br class="">      Main:sink += fibonacci(20);<br class="">    } finally {<br class="">      if (semaphore != null) {<br class="">        semaphore.release();<br class="">      }<br class="">    }<br class="">  }<br class=""><br class="">  private static int fibonacci(int n) {<br class="">    if (n == 0) {<br class="">      return 0;<br class="">    } else if (n == 1) {<br class="">      return 1;<br class="">    } else {<br class="">      return fibonacci(n - 1) + fibonacci(n - 2);<br class="">    }<br class="">  }<br class="">}<br class=""><br class=""></div><div class=""><br class=""><blockquote type="cite" class=""><div class="">On May 30, 2024, at 7:27 AM, Robert Engels <<a href="mailto:rengels@ix.netcom.com" target="_blank" class="">rengels@ix.netcom.com</a>> wrote:</div><br class=""><div class=""><div dir="auto" class=""><div dir="ltr" class=""></div><div dir="ltr" class="">I am going to dig in some more today - interesting problem. Is it maybe in scenario 2 you are creating the 1M queue entries and 1M VT at the same time? I don’t remember if the queue entry is is actually GCable until it completes in order to support error reporting. </div><div dir="ltr" class=""><br class=""><blockquote type="cite" class="">On May 30, 2024, at 7:20 AM, Attila Kelemen <<a href="mailto:attila.kelemen85@gmail.com" target="_blank" class="">attila.kelemen85@gmail.com</a>> wrote:<br class=""><br class=""></blockquote></div><blockquote type="cite" class=""><div dir="ltr" class=""><div dir="ltr" class="">They only create 600 VT, but they do create 1M queue entries for the executor, and the relative memory usage should be the same for the scenario of 10k tasks and the 1M (both in terms of bytes and number of objects). I would love to see the result of this experiment with the epsilon GC (given that the total memory usage should be manageable even for 1M tasks) to confirm or exclude the possibility of the GC scaling this noticeably poorly.</div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Robert Engels <<a href="mailto:rengels@ix.netcom.com" target="_blank" class="">rengels@ix.netcom.com</a>> ezt írta (időpont: 2024. máj. 30., Cs, 14:10):<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto" class=""><div dir="ltr" class=""></div><div dir="ltr" class="">That is what I pointed out - in scenario 2 you are creating 1M VT up front. The other cases only create at most 600 VT or platform threads. </div><div dir="ltr" class=""><br class=""></div><div dir="ltr" class="">The peak memory usage in scenario 2 is much much higher. </div><div dir="ltr" class=""><br class=""><blockquote type="cite" class="">On May 30, 2024, at 7:07 AM, Attila Kelemen <<a href="mailto:attila.kelemen85@gmail.com" target="_blank" class="">attila.kelemen85@gmail.com</a>> wrote:<br class=""><br class=""></blockquote></div><blockquote type="cite" class=""><div dir="ltr" class=""><div dir="ltr" class=""><div class="gmail_quote"><div class="">Though the additional work the VT has to do is understandable. However, I don't see them explaining these measurements. Because in the case of 10k tasks VT wins over FJP, but with 1M tasks, VT loses to FJP. What is the source of the scaling difference, when there are still only 128 carriers, and 600 concurrent threads in both cases? If this was merely more work, then I would expect to see the same relative difference between FJP and VT when there are 10k tasks and when there are 1M tasks. Just a wild naive guess: Could the GC scale worse for that many VTs, or is that a stupid idea?</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br class="">
If the concurrency for the virtual thread run is limited to the same <br class="">
value as the thread count in the thread pool runs then you are unlikely <br class="">
to see benefit. The increased CPU time probably isn't too surprising <br class="">
either. In the two runs with threads then the N task are queued once. In <br class="">
the virtual thread run then the tasks for the N virtual threads may be <br class="">
queued up to 4 times, one for the initial submit, one waiting for <br class="">
semaphore permit, and twice for the two sleeps. Also when CPU <br class="">
utilization is low (as I assume it is here) then the FJP scan does tend <br class="">
up to show up in profiles.<br class="">
<br class="">
Has Chi looked into increasing the concurrency so that it's not limited <br class="">
to 600? Concurrency may need limited at finer grain the "real world <br class="">
program", but may not the number of threads.<br class="">
<br class="">
-Alan<br class="">
<br class="">
</blockquote></div></div>
</div></blockquote></div></blockquote></div>
</div></blockquote></div></div></blockquote></div><br class=""></div></div></div></div></blockquote></div><br class=""></div></blockquote></div>
</div></blockquote></div><br class=""></div></div></div></blockquote></div><br class=""></div></body></html>