effectiveness of jdk.virtualThreadScheduler.maxPoolSize

Mon Jan 9 12:58:56 UTC 2023

As I wrote to the mailing list some years ago, the virtual thread scheduler is designed in a way that will create this phenomenon, and it is done in this way because two things are unclear: 1. Whether or not this phenomenon is a problem, and 2. Whether time-sharing could help if it is. What is unclear is not whether such circumstances could hypothetically arise, but whether or not they actually do, and if so, at what frequency.

Let’s examine what would be required for your example to manifest as a problem that time-sharing could help with: the rate of slowCPURequest needs to be high enough to saturate the CPU for long enough to cause a significant increase in the latency of fastIORequest, and at the same time low enough not to destabilise the server so that requests start piling up (for the server to be stable, the average throughput must be exactly equal to the average request rate). If the rate of slowCPURequest is any higher or lower, time sharing won’t help.

So the question is, how common is it for servers to run in that operational band, and how do such servers ensure that they can always maintain their throughput and don’t destabilise? This is the kind of information that will allow us to to start working on time-sharing, and it requires a report, not any kind of convincing. We can’t work on a fix for problem X encountered while running server Y until we get a report saying “I’ve encountered problem X while running server Y”.

At this time, however, I will appreciate it if someone could explain why it is so hard to understand that we cannot fix a bug until it is reported that I’ve had to repeat this so many times. Even if we decided today that we’re dropping everything else and working on time-sharing in the scheduler because that is the most urgent thing to work on, we still couldn’t do it without the information required to know when the problem is fixed and we can move on to something else. If it’s hard to understand because you think it is obvious that this bug will occur since the scenario I described above is common, then someone will report it soon enough, at which point we’ll have at least some of the relevant operational parameters and will then be able to start considering solutions. Without that necessary information, it doesn’t matter how much we want to work on a fix; we simply can’t.

Now, while it won’t help us identifying a bug, it might be interesting for you to at least simulate an *endless* stream of requests, and try different rates of the two kinds of operations (and add at least a bit more CPU usage in the fastIO case).

— Ron

On 9 Jan 2023, at 12:04, Arnaud Masson <arnaud.masson at fr.ibm.com<mailto:arnaud.masson at fr.ibm.com>> wrote:

I doubt it will suddenly convince anyone, but I wrote a minimal example to illustrate the concrete problem.
With the classic thread executor, the normal requests (fastIORequest) complete quickly even during concurrent CPU requests (slowCPURequest), while using the default Loom executor they are stuck for some time.

Arnaud

package org.example;

import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Timer;
import java.util.TimerTask;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

public class MixedServer {

   static final Instant t0 = Instant.now();

   static long slowCPURequest(String name, Instant posted) throws Exception {
      long t = 0;
      var n = 1e9;

      // slow CPU-only stuff: e.g. image codec, XML manip, collection iteration, ...
      for (int i = 0; i < n; i++) {
         t = t + 2;
      }

      var completed = Instant.now();
      System.out.println(Duration.between(t0, Instant.now()) + " - " + name
            +  " completed (duration: " + Duration.between(posted, completed) + ")");
      return t;
   }

   static Object fastIORequest(Instant posted) throws Exception {
      // IO stuff, e.g. JDBC call
      // (but sleep would have the same effect for this test, i.e. block the thread)
      Thread.sleep(2);

      var completed = Instant.now();
      System.out.println(Duration.between(t0, Instant.now()) + " - fast IO task"
            +  " completed (duration: " + Duration.between(posted, completed) + ")");
      return null;
   }

   public static void main(String[] args) throws Exception {
      var executor =
            Executors.newVirtualThreadPerTaskExecutor(); // unfair
            //Executors.newCachedThreadPool(); // fair

      var timer = new Timer();
      timer.schedule(new TimerTask() {
         @Override
         public void run() {
            var posted = Instant.now();
            executor.submit(() -> fastIORequest(posted));
         }
      }, 100, 100);

      var futures = new ArrayList<Future>();
      for (int i = 0; i < 60; i++) {
         final var name = "Task_" + i;
         var posted = Instant.now();
         futures.add(executor.submit(() -> slowCPURequest(name, posted)));
      }

      for (Future f: futures)
         f.get();

      System.out.println("--- done ---");
      executor.shutdownNow();
      timer.cancel();
   }
}

Great, then one someone who runs CPU-heavy jobs that are appropriate for virtual threads presents a problem the’ve encountered that could be solved by some form of time-sharing, we’ll take a look. For the time being, we have a hypothesis that
ZjQcmQRYFpfptBannerStart
Great, then one someone who runs CPU-heavy jobs that are appropriate for virtual threads presents a problem the’ve encountered that could be solved by some form of time-sharing, we’ll take a look. For the time being, we have a hypothesis that one or more problems could occur, and a further hypothesis that some class of solutions might fix or mitigate them. Unfortunately, theses hypotheses are not presently actionable.

I don’t know what problem Go’s maintainers solve with time-sharing. Maybe they just want to solve Go’s lack of convenient access to a scheduler with time-sharing even in the case of background processing tasks — a problem that Java doesn’t have — or perhaps they’ve managed to identify common server workloads that could be helped by time-sharing, in which case we’d love to see those workloads so that we can make sure our fix works.

I am not at all resistant to adding time-sharing to the virtual thread scheduler. I am resistant to fixing bugs that have not been reported. I have said many times — and not just on the mailing lists — that the best and often only way to get a some change made to Java is to report a problem. You might have noticed that all JEPs start with a motivation section that attempts to clearly present a problem that’s encountered by Java’s users (and sometimes maintainers) and analyse its relative severity to justify the proposed solution. That is usually the JEP’s most important section (and the section we  typically spend most time writing) because the most important thing is to understand the problem you’re trying to tackle. Every change has a cost. A feature might add overhead that harms workloads that don’t benefit from it, and it certainly has an opportunity cost. Neither of these is disqualifying, but we simply cannot judge the pros and cons of doing something until we can weigh some problem against another, and we can’t even get started on this process until we have a problem in front of us.

We *have* been presented with a problem that some specific kind of time-sharing can solve (postponing a batch job that’s consuming resources to run at a much later time), and it is one of the reasons we’ve added custom schedulers that will be able to employ it to our roadmap. It is certainly possible that the lack of time-sharing causes problems that need addressing in the default (currently only) virtual-thread scheduler — I am not at all dismissing that possibility — but there’s not much we can do until those problems are actually reported to us, which would allow us to know more about when those problems occur, how frequently, and what kind of time-sharing can assist in which circumstances and to what degree. There’s no point in trying to convince us of something we are already convinced of, namely that the possibility exists that some virtual thread workloads could be significantly helped by time-sharing. In fact, I’ve mentioned that possibility on this mailing list a few times already.

If the problems are common, now that more people are giving virtual threads a try, I expect they will be reported by someone soon enough and we could start the process of tackling them.

— Ron

On 9 Jan 2023, at 08:48, Sam Pullara <spullara at gmail.com<mailto:spullara at gmail.com>> wrote:

Ron, I think you are being purposefully obtuse by not recognizing that some folks are going to run high CPU jobs in vthreads. The proof is with the folks using Go that already encountered it and fixed it.

On Mon, Jan 9, 2023 at 12:46 AM Arnaud Masson <arnaud.masson at fr.ibm.com<mailto:arnaud.masson at fr.ibm.com>> wrote:
Side note : it seems “more” preemptive time sharing was added for goroutines in Go 1.14 to avoid the kind of scheduling starvation we discussed:

https://medium.com/a-journey-with-go/go-asynchronous-preemption-b5194227371c

Thanks
Arnaud

Unless otherwise stated above:

Compagnie IBM France
Siège Social : 17, avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 664 069 390,60 €
SIRET : 552 118 465 03644 - Code NAF 6203Z

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230109/2cd94668/attachment-0001.htm>