Virtual Threads performance

Wed Apr 5 01:08:29 UTC 2023

Hi All,

I've done the analysis of benchmarks shown in the article and I want to say that observed performance difference has nothing with how Thread.sleep() is implemented in Loom.

As the result, I'd like to say that the citation "...virtual thread mechanism begins to perform notably slower than Kotlin's coroutines ..." should be expanded by adding "on this particular microbenchmark".

I took the heaviest and had the most Kotlin vs Loom diff case: 100 ms delay + 100000 tasks.
I was able to repeat the results locally on my machine. Here are the results:

Benchmark                    (delay)  (repeat)  Mode  Cnt    Score    Error  Units
LoomVsCoroutines.coroutines      100    100000  avgt    5  222.202 ±  1.638  ms/op
LoomVsCoroutines.hybrid          100    100000  avgt    5  317.583 ± 18.985  ms/op
LoomVsCoroutines.loom            100    100000  avgt    5  361.026 ± 85.797  ms/op

Loom is slower than the hybrid approach and much slower than pure Kotlin's coroutines. 
Looking inside, I found two reasons for such difference. 

1. The main reason is how Kotlin schedules task execution with the default dispatcher.

I have to admit I don't know Kotlin's internals, I've got all details from executions. The fact is - there is no concurrent execution when "LoomVsCoroutines.coroutines" is running. All actions are working inside the single benchmark platform thread (having "kotlinx.coroutines.BlockingCoroutine.joinBlocking()" as call stack root). At the same time "loom" and "hybrid" truly schedule tasks using "virtualThreadPerTaskExecutor".

Lets do the simple benchmark modification. I added a bit of CPU work. "Blackhole.consumeCPU(1000)" was added inside tasks for all 3 benchmarks. consumeCPU(1000) is a small operation, that takes 3800 nanoseconds on my machine. That is 0.0038% of delay time (100ms).
Here are the results:

Benchmark                     (delay)  (repeat)  Mode  Cnt    Score    Error  Units
LoomVsCoroutines2.coroutines      100    100000  avgt    5  610.809 ±  1.156  ms/op
LoomVsCoroutines2.hybrid          100    100000  avgt    5  337.983 ± 10.645  ms/op
LoomVsCoroutines2.loom            100    100000  avgt    5  375.073 ± 19.476  ms/op

Now "loom" and "hybrid" is twice faster than Kotlin's default dispatcher. Having something like "sameThreadExecutor" as a default dispatcher makes sense when we are doing nothing, but a small piece of actual work would be better to execute in parallel.

I did a heavier test, when CPU activity takes 4% of delay time. The theoretical minimum for my 8-cores machine is ~51 seconds per op, "loom" and "hybrid" took ~52 seconds each, and "coroutines" took  ~390 seconds.

2. The second reason has a smaller impact on the benchmark discussed in the article than the first one, but from my point of view it's more important in general especially when we are trying to compare Loom's performance vs something else.

It's a fundamental difference between Kotlin (or C#) and Loom (or Go) approaches. How suspend/resume or mount/unmount are done, how we detect the "cut point" when coroutine (or virtual thread) should be removed from execution.
Kotlin (and C#) uses compile-time approach (static). All such cut-points are created by developers and known at compile time. 
Loom (and Go) uses run-time approach (dynamic). Cut-points are detected at runtime when an actual blocking/contention happens.

Compile-time approach pros: Some work could be performed at compile-time, "continuation" structure is known at compile-time and generated for each cut-point. The execution state could be easily saved/restored.
Compile-time approach cons: annoying function coloring, blocking may happen if we don't properly wrap up blocking function into suspendable.
Run-time approach pros: simple code (as before Loom), no need to worry about cut-points.
Run-time approach cons: complex mount/unmount operation - we need to be able to save/restore almost arbitrary stack.

Such difference opens the way to "goal-aware" benchmarking. 
If our goal is to show that compiler-time approach is faster, we can write a benchmark where all cut-points are perfectly described and happens at run-time. In such case static way could be faster due to faster continuations.
If our goal is to show that the run-time approach is faster, we can create a benchmark when there are many false cut-points, which will have a performance effect in a static way, but they will be successfully bypassed in dynamic way.

That describes the performance difference between "hybrid" and "loom" benchmarks. The particular difference between Kotlin's "delay" and Loom's "Thread.sleep".
That was observed on the heavy case (100 ms delay, 100000 tasks): Loom's mount/unmout takes ~15% of CPU, Kotlin's continuations save/restore takes ~3% of CPU.

________________________________________
From: jdk-dev <jdk-dev-retn at openjdk.org> on behalf of Alan Bateman <Alan.Bateman at oracle.com>
Sent: Monday, April 3, 2023 2:46 AM
To: Volker Simonis
Cc: Attila Kelemen; jdk-dev
Subject: Re: Virtual Threads performance

On 03/04/2023 10:40, Volker Simonis wrote:
>
> What is "STPE"?

Apologies, I should have make that clear. I mean
j.u.concurrent.ScheduledThreadPoolExecutor.

-Alan