Experience using virtual threads in EA 23-loom+4-102

Mon Jul 8 19:42:48 UTC 2024

Hi Matt,

Thanks for the write-up, I appreciate it!

Cheers,
√

Viktor Klang
Software Architect, Java Platform Group
Oracle
________________________________
From: loom-dev <loom-dev-retn at openjdk.org> on behalf of Matthew Swift <matthew.swift at gmail.com>
Sent: Saturday, 6 July 2024 20:20
To: loom-dev at openjdk.org <loom-dev at openjdk.org>
Subject: Re: Experience using virtual threads in EA 23-loom+4-102

Thanks everyone for the helpful responses and continued discussion.

Returning to my original message about high tail latencies when using virtual threads compared with platform threads, Viktor's response, in particular, prompted me to question whether I was missing something obvious. I fully agree that the high tail latencies are due to over saturation, but I still wanted to understand why the tail latencies were an order of magnitude worse when using virtual threads.

I'm pleased to report (and slightly embarrassed!) that the cause of the bigger tail latencies is slightly more banal than subtle scheduling characteristics of the FJ pool, etc.

As a reminder, the test involved a client sustaining 120[*] concurrent database transactions, each touching and locking over 400 DB records. The fixed size platform thread pool had 40 threads which had the effect of limiting concurrency to 40 simultaneous transactions. I naively swapped the platform thread pool for a virtual thread pool, which had the effect of removing the concurrency limits, allowing all 120 transactions to run concurrently and with a corresponding increase in lock thrashing / convoys. To test this I adjusted the platform thread pool to 120 threads and observed exactly the same high tail latencies that I saw with virtual threads. Conversely, the high tail latencies were reduced by an order of magnitude by throttling the writes using a Semaphore. In fact, I was able to reduce the P99.999 latency from over 10 seconds to 200ms with 20 permits, with no impact on P50.

What's the conclusion here? In hindsight, the guidance in the section "Do not pool virtual threads" of JEP444 is pretty clear about using Semaphores for limiting concurrent access to limited resources. I think that the platform thread pool that we were using was using a small number of threads (e.g. 2 x #CPUs), as it was optimized for a mix of read txns (CPU bound, minimal locking) and write txns (low latency SSD IO, short lived record level locks), rather than high latency network IO. Thus, coincidentally, we never really observed significant lock thrashing related tail latencies and, therefore, didn't think we needed to limit concurrent access for writes.

Cheers,
Matt
[*] I know virtual threads are intended for use cases involving thousands of concurrent requests. We see virtual threads as an opportunity for implementing features that may block threads for significant periods of time, such as fine grained rate limiting, distributed transactions and offloading password hashing to external microservices. Naturally, there will be a corresponding very significant increase in the number of concurrent requests. However, core functionality still needs to perform well and degrade gracefully, hence the test described in this thread.

On Fri, 21 Jun 2024 at 19:05, Matthew Swift <matthew.swift at gmail.com<mailto:matthew.swift at gmail.com>> wrote:
Hello again,

As promised, here is my second (shorter I hope!) email sharing feedback on the recent Loom EA build (23-loom+4-102). If follows up on my previous email https://mail.openjdk.org/pipermail/loom-dev/2024-June/006788.html.

I performed some experiments using the same application described in my previous email. However, in order to properly test the improvements to Object monitors (synchronized blocks and Object.wait()) I reverted all of the thread-pinning related changes that I had made in order to support virtual threads with JDK21. Specifically, I reverted the changes converting uses of monitors to ReentrantLock.

I'm pleased to say that this EA build looks extremely promising! :-)

### Experiment #1: read stress test

* platform threads: 215K/s throughput, CPU 14% idle
* virtual threads: 235K/s throughput, CPU 5% idle.

Comment: there's a slight throughput improvement, but CPU utilization is slightly higher too. Presumably this is due to the number of carrier threads being closely matched to the number of CPUs (I noticed significantly less context switching with v threads).

### Experiment #2: heavily indexed write stress test, with 40 clients

* platform threads: 9300/s throughput, CPU 27% idle
* virtual threads: 8800/s throughput, CPU 24% idle.

Comment: there is a ~5% performance degradation using virtual threads. This is better than the degradation I observed in my previous email after switching to ReentrantLock though.

### Experiment #3: extreme heavy indexed write stress test, with 120 clients

* platform threads: 1450/s throughput
* virtual threads: 1450/s throughput (i.e. about the same).

Comment:

This test is intended to stress the internal locking mechanisms as much as possible and expose any pinning problems.
With JDK21 virtual threads the test would sometimes deadlock and thread dumps would show 100+ fork join carrier threads.
This is no longer the case with the EA build. It looks really solid.

This test does expose one important difference between platform threads and virtual threads though. Let's take a look at the response times:

Platform threads:

-------------------------------------------------------------------------------
|     Throughput    |                 Response Time                |          |
|    (ops/second)   |                (milliseconds)                |          |
|   recent  average |   recent  average    99.9%   99.99%  99.999% |  err/sec |
-------------------------------------------------------------------------------
...
|   1442.6   1606.6 |   83.097   74.683   448.79   599.79   721.42 |      0.0 |
|   1480.8   1594.0 |   81.125   75.282   442.50   599.79   721.42 |      0.0 |

Virtual threads:

 -------------------------------------------------------------------------------
|     Throughput    |                 Response Time                |          |
|    (ops/second)   |                (milliseconds)                |          |
|   recent  average |   recent  average    99.9%   99.99%  99.999% |  err/sec |
-------------------------------------------------------------------------------
...
|   1445.4   1645.3 |   81.375   72.623  3170.89  4798.28  8925.48 |      0.0 |
|   1442.2   1625.0 |   81.047   73.371  3154.12  4798.28  6106.91 |      0.0 |

The outliers with virtual threads are much much higher. Could this be due to potential starvation when rescheduling virtual threads in the fork join pool?

Cheers,
Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240708/69fa9c36/attachment-0001.htm>