Performance of pooling virtual threads vs. semaphores

Thu May 30 13:35:52 UTC 2024

Reworking the design (scenario 4) brings the pooling and new VT per task in line:

iMac:vt_test robertengels$ time java -Djdk.virtualThreadScheduler.parallelism=128 Main dummy 2 1000000

real	0m38.405s
user	3m42.240s
sys	0m12.976s

iMac:vt_test robertengels$ time java -Djdk.virtualThreadScheduler.parallelism=128 Main dummy 3 1000000

real	0m37.710s
user	2m28.901s
sys	0m3.427s

iMac:vt_test robertengels$ time java -Djdk.virtualThreadScheduler.parallelism=128 Main dummy 4 1000000

real	0m38.441s
user	2m39.547s
sys	0m7.027s

My machine only has 8 real cores, so I expect greater system cpu usage due to kernel thread scheduling. There is also the additional semaphore management.

In scenario 2, the system is starting and scheduling all 1 million threads, just to have them go park, waiting on the semaphore, then to be woken up and rescheduled. This is not insignificant. Scenario 4 avoid this.

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.Semaphore;
import java.util.concurrent.ThreadFactory;

public class Main {

  private static Semaphore semaphore = null;
  private static int sink = 0;

  public static void main(String[] args) {
    int strategy = 0;
    int parallelism = 600;
    int numTasks = 10000;

    if (args.length > 1) {
      strategy = Integer.parseInt(args[1]);
    }

    if (args.length > 2) {
      numTasks = Integer.parseInt(args[2]);
    }

    ExecutorService executor;
    switch (strategy) {
      case 1 -> {
        executor = new ForkJoinPool(parallelism);
      }
      case 2 -> {
        executor = Executors.newVirtualThreadPerTaskExecutor();
        semaphore = new Semaphore(parallelism);
      }
      case 3 -> {
        executor = Executors.newFixedThreadPool(parallelism, Thread.ofVirtual().factory());
      }
      case 4 -> {
        var factorySem = new Semaphore(parallelism);
        ThreadFactory tf = (Runnable r) -> {
            try {
                factorySem.acquire();
            } catch (InterruptedException ex) {
                throw new IllegalStateException("interrupted");
            }
            return Thread.ofVirtual().unstarted(() -> 
                {
                    try { 
                        r.run(); 
                    } finally { 
                        factorySem.release();
                    }
                });
        };
        executor = Executors.newThreadPerTaskExecutor(tf);
      }
      default -> {
        throw new IllegalArgumentException();
      }
    }

    try (executor) {
      for (var i = 0; i < numTasks; ++i) {
        executor.execute(Main::task);
      }
    }
  }

  private static void task() {
    if (semaphore != null) {
      try {
        semaphore.acquire();
      } catch (InterruptedException e) {
        throw new IllegalStateException();
      }
    }

    try {
      Main:sink += fibonacci(20);
      try {
        Thread.sleep(10);
      } catch (InterruptedException e) {
      }
      Main:sink += fibonacci(20);
      try {
        Thread.sleep(10);
      } catch (InterruptedException e) {
      }
      Main:sink += fibonacci(20);
    } finally {
      if (semaphore != null) {
        semaphore.release();
      }
    }
  }

  private static int fibonacci(int n) {
    if (n == 0) {
      return 0;
    } else if (n == 1) {
      return 1;
    } else {
      return fibonacci(n - 1) + fibonacci(n - 2);
    }
  }
}

> On May 30, 2024, at 7:27 AM, Robert Engels <rengels at ix.netcom.com> wrote:
> 
> I am going to dig in some more today - interesting problem. Is it maybe in scenario 2 you are creating the 1M queue entries and 1M VT at the same time? I don’t remember if the queue entry is is actually GCable until it completes in order to support error reporting. 
> 
>> On May 30, 2024, at 7:20 AM, Attila Kelemen <attila.kelemen85 at gmail.com> wrote:
>> 
>> 
>> They only create 600 VT, but they do create 1M queue entries for the executor, and the relative memory usage should be the same for the scenario of 10k tasks and the 1M (both in terms of bytes and number of objects). I would love to see the result of this experiment with the epsilon GC (given that the total memory usage should be manageable even for 1M tasks) to confirm or exclude the possibility of the GC scaling this noticeably poorly.
>> 
>> Robert Engels <rengels at ix.netcom.com <mailto:rengels at ix.netcom.com>> ezt írta (időpont: 2024. máj. 30., Cs, 14:10):
>> That is what I pointed out - in scenario 2 you are creating 1M VT up front. The other cases only create at most 600 VT or platform threads. 
>> 
>> The peak memory usage in scenario 2 is much much higher. 
>> 
>>> On May 30, 2024, at 7:07 AM, Attila Kelemen <attila.kelemen85 at gmail.com <mailto:attila.kelemen85 at gmail.com>> wrote:
>>> 
>>> 
>>> Though the additional work the VT has to do is understandable. However, I don't see them explaining these measurements. Because in the case of 10k tasks VT wins over FJP, but with 1M tasks, VT loses to FJP. What is the source of the scaling difference, when there are still only 128 carriers, and 600 concurrent threads in both cases? If this was merely more work, then I would expect to see the same relative difference between FJP and VT when there are 10k tasks and when there are 1M tasks. Just a wild naive guess: Could the GC scale worse for that many VTs, or is that a stupid idea?
>>>  
>>> 
>>> If the concurrency for the virtual thread run is limited to the same 
>>> value as the thread count in the thread pool runs then you are unlikely 
>>> to see benefit. The increased CPU time probably isn't too surprising 
>>> either. In the two runs with threads then the N task are queued once. In 
>>> the virtual thread run then the tasks for the N virtual threads may be 
>>> queued up to 4 times, one for the initial submit, one waiting for 
>>> semaphore permit, and twice for the two sleeps. Also when CPU 
>>> utilization is low (as I assume it is here) then the FJP scan does tend 
>>> up to show up in profiles.
>>> 
>>> Has Chi looked into increasing the concurrency so that it's not limited 
>>> to 600? Concurrency may need limited at finer grain the "real world 
>>> program", but may not the number of threads.
>>> 
>>> -Alan
>>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240530/755195f7/attachment-0001.htm>