Understanding Virtual thread performance for quick API calls

Thu Jul 25 10:06:42 UTC 2024

Some thoughts in case it's useful (I've done a similar test quite a long
time ago).

To me, it looks like with VT it will approximately try to submit 10K
requests *almost* concurrently, where as with platform threads the number
of concurrent requests is limited to the number of available processors (so
there is a lot less concurrent requests in the platform thread case - there
is a good chance the server is much happier at the lower concurrency).

First note is that the VT test will can look like a *Denial of
service attack*. As I see it, for this test to be useful the server needs
to be able to handle the level of concurrency we want to throw at it. When
I ran this test I needed my http server to run VT in order to handle the
high level of concurrent requests (as otherwise the http server hits its
max concurrency and then effectively queues requests). If you know the max
concurrency that the http server can handle, then you can run interesting
tests around this max. The question for this test, is if
http://192.168.1.17:8080/v1/crawl/delay/ is happy with the level of
concurrency being thrown at it in the VT case.

When I ran a similar test (quite some time ago) I also found empirically
that I needed to slow down the submission of jobs in the VT case. That is,
it wasn't happy submitting 10K requests in a really tight loop and I
believe I actually added in some LockSupport.parkNanos() inside that tight
10K loop in order to get what I thought was "better" behaviour on the test.

Hopefully that gives you some ideas.

Cheers, Rob.

On Thu, 25 Jul 2024 at 21:12, David <david.vlijmincx at gmail.com> wrote:

> Hi,
>
> I'm looking for some insight into some benchmark performance [1]. I'm
> comparing the performance of Virtual threads and Platform threads by
> submitting 10_000 tasks that perform 1 or 2 API requests to either
> a newVirtualThreadPerTaskExecutor or an
> Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()).
>
> I expected the Virtual threads to perform better because the tasks don't
> look CPU-bound (response times around 4ms), but platform threads
> outperformed the Virtual threads when a response returned within 9ms. The
> results using JDK 24-loom+1-17 (2024/6/22) were the following:
>
> Benchmark                                                      (delay)
>  (numberOfCalls)  Mode  Cnt  Score    Error  Units
> Improv.LoomBenchmark.FixedPlatformPool            0                1  avgt
>   10  0.808 ±  0.001   s/op
> Improv.LoomBenchmark.FixedPlatformPool            0                2  avgt
>   10  1.614 ±  0.001   s/op
> Improv.LoomBenchmark.FixedPlatformPool            1                1  avgt
>   10  1.239 ±  0.002   s/op
> Improv.LoomBenchmark.FixedPlatformPool            1                2  avgt
>   10  2.473 ±  0.004   s/op
> Improv.LoomBenchmark.FixedPlatformPool            2                1  avgt
>   10  1.461 ±  0.003   s/op
> Improv.LoomBenchmark.FixedPlatformPool            2                2  avgt
>   10  2.918 ±  0.004   s/op
> Improv.LoomBenchmark.FixedPlatformPool            3                1  avgt
>   10  1.730 ±  0.002   s/op
> Improv.LoomBenchmark.FixedPlatformPool            3                2  avgt
>   10  3.461 ±  0.008   s/op
> Improv.LoomBenchmark.FixedPlatformPool            4                1  avgt
>   10  2.017 ±  0.003   s/op
> Improv.LoomBenchmark.FixedPlatformPool            4                2  avgt
>   10  4.032 ±  0.007   s/op
> Improv.LoomBenchmark.FixedPlatformPool            5                1  avgt
>   10  2.317 ±  0.003   s/op
> Improv.LoomBenchmark.FixedPlatformPool            5                2  avgt
>   10  4.633 ±  0.006   s/op
>
> Improv.LoomBenchmark.virtualThreadExecutor        0                1  avgt
>   10  2.976 ±  0.262   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        0                2  avgt
>   10  2.909 ±  0.093   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        1                1  avgt
>   10  2.853 ±  0.181   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        1                2  avgt
>   10  2.913 ±  0.146   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        2                1  avgt
>   10  2.875 ±  0.254   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        2                2  avgt
>   10  2.876 ±  0.112   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        3                1  avgt
>   10  2.772 ±  0.126   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        3                2  avgt
>   10  2.856 ±  0.196   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        4                1  avgt
>   10  2.731 ±  0.166   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        4                2  avgt
>   10  2.888 ±  0.155   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        5                1  avgt
>   10  2.855 ±  0.213   s/op
> Improv.LoomBenchmark.virtualThreadExecutor        5                2  avgt
>   10  2.906 ±  0.172   s/op
>
> The delay column shows how much delay in milliseconds was added by the
> end-point. Without any added delay the response time is about 4ms. I
> attached two visualizations of the results to this mail.
>
> I'd be grateful if someone could shed some light on potential reasons
> behind virtual threads underperforming for these short-lived API with below
> 9ms response times. Are there any best practices for benchmarking Virtual
> threads that I might be missing?
>
> Kind regards,
> David
>
> [1]:
> https://github.com/davidtos/VirtualThreadPerformanceShortLivedTasks/blob/master/src/main/java/Improv/LoomBenchmark.java
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240725/0757f5ac/attachment-0001.htm>