Experience using virtual threads in EA 23-loom+4-102

Sat Jul 6 18:20:35 UTC 2024

Thanks everyone for the helpful responses and continued discussion.

Returning to my original message about high tail latencies when using
virtual threads compared with platform threads, Viktor's response, in
particular, prompted me to question whether I was missing something
obvious. I fully agree that the high tail latencies are due to over
saturation, but I still wanted to understand why the tail latencies were an
order of magnitude worse when using virtual threads.

I'm pleased to report (and slightly embarrassed!) that the cause of the
bigger tail latencies is slightly more banal than subtle scheduling
characteristics of the FJ pool, etc.

As a reminder, the test involved a client sustaining 120[*] concurrent
database transactions, each touching and locking over 400 DB records. The
fixed size platform thread pool had 40 threads which had the effect of
limiting concurrency to 40 simultaneous transactions. I naively swapped the
platform thread pool for a virtual thread pool, which had the effect of
removing the concurrency limits, allowing all 120 transactions to run
concurrently and with a corresponding increase in lock thrashing / convoys.
To test this I adjusted the platform thread pool to 120 threads and
observed exactly the same high tail latencies that I saw with virtual
threads. Conversely, the high tail latencies were reduced by an order of
magnitude by throttling the writes using a Semaphore. In fact, I was able
to reduce the P99.999 latency from over 10 seconds to 200ms with 20
permits, with no impact on P50.

What's the conclusion here? In hindsight, the guidance in the section "Do
not pool virtual threads" of JEP444 is pretty clear about using Semaphores
for limiting concurrent access to limited resources. I think that the
platform thread pool that we were using was using a small number of threads
(e.g. 2 x #CPUs), as it was optimized for a mix of read txns (CPU bound,
minimal locking) and write txns (low latency SSD IO, short lived record
level locks), rather than high latency network IO. Thus, coincidentally, we
never really observed significant lock thrashing related tail latencies
and, therefore, didn't think we needed to limit concurrent access for
writes.

Cheers,
Matt
[*] I know virtual threads are intended for use cases involving thousands
of concurrent requests. We see virtual threads as an opportunity for
implementing features that may block threads for significant periods of
time, such as fine grained rate limiting, distributed transactions and
offloading password hashing to external microservices. Naturally, there
will be a corresponding very significant increase in the number of
concurrent requests. However, core functionality still needs to perform
well and degrade gracefully, hence the test described in this thread.

On Fri, 21 Jun 2024 at 19:05, Matthew Swift <matthew.swift at gmail.com> wrote:

> Hello again,
>
> As promised, here is my second (shorter I hope!) email sharing feedback on
> the recent Loom EA build (23-loom+4-102). If follows up on my previous
> email https://mail.openjdk.org/pipermail/loom-dev/2024-June/006788.html.
>
> I performed some experiments using the same application described in my
> previous email. However, in order to properly test the improvements to
> Object monitors (synchronized blocks and Object.wait()) I reverted all of
> the thread-pinning related changes that I had made in order to support
> virtual threads with JDK21. Specifically, I reverted the changes converting
> uses of monitors to ReentrantLock.
>
> I'm pleased to say that this EA build looks extremely promising! :-)
>
> ### Experiment #1: read stress test
>
> * platform threads: 215K/s throughput, CPU 14% idle
> * virtual threads: 235K/s throughput, CPU 5% idle.
>
> Comment: there's a slight throughput improvement, but CPU utilization is
> slightly higher too. Presumably this is due to the number of carrier
> threads being closely matched to the number of CPUs (I noticed
> significantly less context switching with v threads).
>
> ### Experiment #2: heavily indexed write stress test, with 40 clients
>
> * platform threads: 9300/s throughput, CPU 27% idle
> * virtual threads: 8800/s throughput, CPU 24% idle.
>
> Comment: there is a ~5% performance degradation using virtual threads.
> This is better than the degradation I observed in my previous email after
> switching to ReentrantLock though.
>
> ### Experiment #3: extreme heavy indexed write stress test, with 120
> clients
>
> * platform threads: 1450/s throughput
> * virtual threads: 1450/s throughput (i.e. about the same).
>
> Comment:
>
> This test is intended to stress the internal locking mechanisms as much as
> possible and expose any pinning problems.
> With JDK21 virtual threads the test would sometimes deadlock and thread
> dumps would show 100+ fork join carrier threads.
> This is no longer the case with the EA build. It looks really solid.
>
> This test does expose one important difference between platform threads
> and virtual threads though. Let's take a look at the response times:
>
> Platform threads:
>
>
> -------------------------------------------------------------------------------
> |     Throughput    |                 Response Time                |
>    |
> |    (ops/second)   |                (milliseconds)                |
>    |
> |   recent  average |   recent  average    99.9%   99.99%  99.999% |
>  err/sec |
>
> -------------------------------------------------------------------------------
> ...
> |   1442.6   1606.6 |   83.097   74.683   448.79   599.79   721.42 |
>  0.0 |
> |   1480.8   1594.0 |   81.125   75.282   442.50   599.79   721.42 |
>  0.0 |
>
> Virtual threads:
>
>
>  -------------------------------------------------------------------------------
> |     Throughput    |                 Response Time                |
>    |
> |    (ops/second)   |                (milliseconds)                |
>    |
> |   recent  average |   recent  average    99.9%   99.99%  99.999% |
>  err/sec |
>
> -------------------------------------------------------------------------------
> ...
> |   1445.4   1645.3 |   81.375   72.623  3170.89  4798.28  8925.48 |
>  0.0 |
> |   1442.2   1625.0 |   81.047   73.371  3154.12  4798.28  6106.91 |
>  0.0 |
>
> The outliers with virtual threads are much much higher. Could this be due
> to potential starvation when rescheduling virtual threads in the fork join
> pool?
>
> Cheers,
> Matt
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240706/12133e9e/attachment-0001.htm>