Virtual Threads performance

Mon Apr 3 08:48:44 UTC 2023

Hi,

I have originally sent this to jdk-dev (see quoted part at the end),
but Alan said loom-dev is a better place.

Anyway, adding to the emails below: What I was mostly looking at is
the case where there is a 10s wait, and 100k threads, because that is
where Loom performs the worst compared to Kotlin coroutines.

I did some other checks compared to the original author. For one, I
have changed to thread submission to be parallel like below (this of
course increases the number of threads involved by a little):

```
private fun loomParallelSubmit(delayCount: Int, blackhole: Blackhole) {
    if (delayCount < PARALLELIZATION_SPLIT_LIMIT) {
        loomSequentialSubmit(delayCount, blackhole)
    } else {
        val delayCountPart1 = delayCount / 2
        val delayCountPart2 = delayCount - delayCountPart1
        StructuredTaskScope.ShutdownOnFailure().use { scope ->
            scope.fork { loomParallelSubmit(delayCountPart1, blackhole) }
            scope.fork { loomParallelSubmit(delayCountPart2, blackhole) }
            scope.join()
            scope.throwIfFailed()
        }
    }
}
```

where `loomSequentialSubmit` is like the submission was before. Also,
I have tried without `StructuredTaskScope` which does not make a
noticable difference.

Here are some other things I have tried:

1. Instead of `sleep`, I have made the tasks wait on the same
`CountDownLatch` while releasing it in a single thread after waiting
10 ms on that thread (so only a single thread is in a timed wait).
Lacking an equivalent in Kotlin, I have used a preacquired `Semaphore`
in Kotlin to achieve the same. This does not change the behaviour
significantly. Reduces coroutines advantage by some, but the
difference remains huge.

2. If instead of only waiting, I add some `consumeCPU(1000)` before
waiting, it changes everything, and Loom significantly outperforms
coroutines (like a factor of 1.5).

3. If instead of waiting, I let the tasks do some `consumeCPU(10000)`
(and no waiting at all), then - for me - Loom outperformed coroutines
by a factor of 4.

An additional interesting observation is that in each case Loom has a
much higher variance in performance than alternative approaches. Even
when Loom is faster, the variance is 3-5 times larger.

Alan Bateman <Alan.Bateman at oracle.com> ezt írta (időpont: 2023. ápr.
3., H, 9:38):
>
> On 01/04/2023 21:00, Attila Kelemen wrote:
> > Hi,
> >
> > I bumped into this article (actually a while ago, but only now did I
> > start to experiment with the performance myself):
> > https://apiumhub.com/tech-blog-barcelona/project-loom-and-kotlin-some-experiments
> > comparing Kotlin coroutines vs. virtual threads.
> >
> > Though I don't think the article itself is a fair measure of Loom's
> > performance (given that most applications do not involve 100k threads
> > sleeping as their main purpose). Especially because when I did a
> > comparison with tasks actually doing some CPU work, then the results
> > were way in favor of Loom.
> >
> > Still, I find it strange that Kotlin coroutines can heavily outperform
> > Loom in the case when there is almost nothing but waiting (especially
> > in the case of timed waiting). Can anyone shed a light on why this can
> > happen? Is this because of a particular trade-off choice?
>
>  From a quick read, this seems to be more about Thread.sleep vs.
> coroutine delay.  Thread.sleep(0) reschedules the current thread whereas
> the documentation for delay(0) suggests it's a no-op.  For the >0 case,
> there is further work needed to improve virtual Thread.sleep. Right now
> it is based on a STPE and there can be contention on its work queue. We
> have done some early experiments with hierarchical timing wheels that
> show promise. There have also been suggestions to use the OS timer
> facility. So improvements should come in time, it's just that other
> areas have been higher priority to date.
>
> Best to follow-up on loom-dev rather than here, esp. if you have your
> own benchmarks to discuss.
>
> -Alan.