Thread scheduling imbalance / starvation

Sun Apr 16 18:05:59 UTC 2023

Hi Robert and Ron,

Thanks for your replies.

> Why would you expect or want this to be balanced? If the threads do IO it
should naturally balance. If they are all cpu bound balancing is needed for
fairness of requesting handling.

I would expect it to be balanced because:
* There's a fair semaphore that limits concurrency
* There are more virtual threads than semaphore permits
* Some of the threads are "presumably" blocked in the call to acquire() and
would take turns to unblock, fairly

>From what Ron described above, my understanding is that some threads don't
even get a chance to start since the first N threads keep looping with no
contention (acquiring and releasing the semaphore) and thereby preventing
the second N threads from ever being scheduled.

If I add a call to Thread.yield() or Thread.sleep() for 1 ns just *once*
before the loop while holding the semaphore, then it works as I would
expect. My hypothesis is that that causes the scheduler to pick one of the
threads that has been waiting to start to run, which subsequently makes it
to the acquire() call and at some point it becomes a contended acquire. The
fair semaphore does its job from then on.

    Thread.ofVirtual().start(() -> {
        semaphore.acquireUninterruptibly();
        Thread.yield();
        semaphore.release();

        while (true) {
            semaphore.acquireUninterruptibly();
            counter.incrementAndGet();
            semaphore.release();
        }
    });

Ron, let me describe our use case. I hope it helps inform future directions
for this feature.

Trino (https://trino.io) is a distributed SQL engine. Queries are
decomposed into tasks that run in a cluster of workers. Each task performs
a series of transformations (filtering, computing new columns,
aggregations, etc). In an ideal world, we would model this as a series of
nested loops, similar to how you'd implement the equivalent of a Java
Stream pipeline using traditional imperative code. The problem with this
approach is that these tasks can take a long time to complete. We need to
be able to handle more tasks than there are available processors and share
time among them.

To do this, we implemented a cooperative multitasking framework, where each
of the tasks does a bit of work and then relinquishes control. A scheduler
within each of the workers the decides which task to run next based on a
prioritization scheme. This is all very unnatural and complex, and it
prevents certain optimizations by forcing the actions within the tasks to
have explicit boundaries, materialize intermediate data structures before
giving up control, etc.

We're hoping that virtual threads will allow us to simplify all of this.
We're also hoping that someday we'll be able to control the scheduling
policies to be able to implement our own prioritization scheme -- although
we have some ideas on how to work around this limitation for now.

- Martin

On Sun, Apr 16, 2023 at 7:55 AM Robert Engels <rengels at ix.netcom.com> wrote:

> Hi Martin,
>
> Why would you expect or want this to be balanced? If the threads do IO it
> should naturally balance. If they are all cpu bound balancing is needed for
> fairness of requesting handling.
>
> This has been brought up a few times. Small tasks can be blocked for a
> long time behind long cpu bound tasks. The only solution is to periodically
> yield() those tasks.
>
> On Apr 16, 2023, at 9:40 AM, Ron Pressler <ron.pressler at oracle.com> wrote:
>
>  Hi.
>
> What you’re seeing is the result of the virtual thread scheduler not
> employing time sharing. That is because we have yet to identify workloads,
> especially those that are best served by virtual threads — namely, servers
> — that can benefit from it. Once we find such workloads we’ll be able to
> utilise time sharing.
>
> In your example, the scheduler is able to keep all threads busy with work
> without blocking on the semaphore by just running some threads.
>
> — Ron
>
> On 16 Apr 2023, at 06:30, Martin Traverso <mtraverso at gmail.com> wrote:
>
> Hi,
>
> First of all, I'd like to thank you for this feature! We've been eagerly
> awaiting it in the Trino project and we believe it will help us
> dramatically simplify many parts of the codebase.
>
> I've been playing around with virtual threads and I've noticed some odd
> behaviors. Given the following code:
>
>     import java.util.ArrayList;
>     import java.util.List;
>     import java.util.concurrent.ExecutionException;
>     import java.util.concurrent.Semaphore;
>     import java.util.concurrent.atomic.AtomicLong;
>
>     public class Test
>     {
>         public static void main(String[] args)
>                 throws InterruptedException
>         {
>             int processors = Runtime.getRuntime().availableProcessors();
>
>             Semaphore semaphore = new Semaphore(processors, true);
>             List<AtomicLong> counters = new ArrayList<>();
>             for (int i = 0; i < 2 * processors; i++) {
>                 AtomicLong counter = new AtomicLong();
>                 counters.add(counter);
>                 Thread.ofVirtual().start(() -> {
>                     while (true) {
>                         semaphore.acquireUninterruptibly();
>                         counter.incrementAndGet();
>                         semaphore.release();
>                     }
>                 });
>             }
>
>             Thread.sleep(10_000);
>
>             counters.stream()
>                     .map(AtomicLong::get)
>                     .sorted()
>                     .forEach(System.out::println);
>         }
>     }
>
> I would expect the counts to be approximately equal, but I'm getting the
> following result:
>
>     0
>     0
>     0
>     0
>     0
>     0
>     0
>     0
>     0
>     0
>     2435341
>     2448274
>     2466202
>     2497258
>     2539030
>     2572744
>     2592871
>     2611658
>     2651392
>     2657913
>
> If I change the number of permits for the semaphore to a value smaller
> than the number of processors, then the results come out as expected. It
> also works as expected if I change the core loop to make a call to
> Thread.yield() on the first iteration:
>
>     while (true) {
>         semaphore.acquireUninterruptibly();
>         if (counter.incrementAndGet() == 1) {
>             Thread.yield();
>         }
>         semaphore.release();
>     }
>
>
> If I place a call to Thread.yield() after the semaphore.release() call,
> then all the threads make some progress, but the values are still
> unbalanced:
>
>     while (true) {
>         semaphore.acquireUninterruptibly();
>         counter.incrementAndGet();
>         semaphore.release();
>         Thread.yield();
>     }
>
>     196257
>     196257
>     196258
>     196260
>     196260
>     196260
>     196261
>     196261
>     401737
>     401740
>     401744
>     401757
>     1644985
>     1651301
>     1677466
>     1683009
>     1694577
>     1702710
>     1710970
>     1843037
>
> I'm running the following version of the JDK on an Macbook Pro with an M1
> Max CPU:
>
> openjdk version "20" 2023-03-21
> OpenJDK Runtime Environment Zulu20.28+85-CA (build 20+36)
> OpenJDK 64-Bit Server VM Zulu20.28+85-CA (build 20+36, mixed mode, sharing)
>
> I'm not sure if this is a bug or if I'm misunderstanding how virtual
> threads are supposed to work. Any help or clarification would be greatly
> appreciated!
>
> Thanks!
> - Martin
>
> ----
> Martin Traverso
> Co-founder @ Trino Software Foundation, Co-creator of Presto and Trino (
> https://trino.io)
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230416/7448bde7/attachment-0001.htm>