Experience of adding JDK21 virtual thread support to a DB application

Fri Jun 21 16:51:19 UTC 2024

I agree that I don’t think JFR is intelligent about the locks.

Maybe create your own sub-class of ReentrentLock and emit custom JFR events?

Otherwise there appears to be too much noise with the Fork-Join pool.

> On Jun 21, 2024, at 10:57 AM, Matthew Swift <matthew.swift at gmail.com> wrote:
> 
> Hello,
> 
> This email is the first of a couple of emails that I'll send sharing our experiences of adding support for virtual threads in a high-performance distributed database. It gives some background, as well as focusing on the migration challenges, rather than the actual usage of virtual threads. The second email will focus on our experience when using virtual threads and, in particular, the recent JDK23 Loom EA build (https://mail.openjdk.org/pipermail/loom-dev/2024-May/006632.html <https://mail.openjdk.org/pipermail/loom-dev/2024-May/006632.html>). The third email focuses on a couple of enhancement requests, although some have already been discussed before on this list.
> 
> Before continuing, I'd like to say a big thank you to those of you who are working on the virtual thread support in Java. My colleagues and I feel that it is an absolute *game-changer* for writing complex scalable applications. Well done!
> 
> Summary of application architecture
> ===================================
> 
> * distributed high-scale Directory Service exposing LDAP and REST frontends for accessing 100s of millions of identities with sub-millisecond response times
> * NIO2 based network IO layer, including dedicated IO threads using AsynchronousChannelGroup
> * IO threads hand-off requests for processing to an executor service, originally implemented using fixed-sized ThreadPoolExecutor
> * requests are processed synchronously by the executor. Most query and/or update an embedded key-value DB (Oracle Berkeley DB Java Edition), but in some cases may call out to external services when proxying
> * extensive use of RxJava throughout, mostly due to the need to implement end-to-end backpressure when acting as a proxy.
> 
> Motivation to use virtual threads
> =================================
> 
> * all the reasons cited in JEP 444! :-) Although RxJava is an amazing asynchronous library, it still suffers from the problems associated with asynchronous programming, such as steep learning curve and difficulties diagnosing problems
> * hard to tune for heterogeneous types of load. CPU intensive requests (e.g. reads) prefer small numbers of threads, otherwise throughput degrades significantly due to excessive context switching. IO intensive requests (e.g. writes) prefer more threads in order to maximize resource utilization
> * requirement to implement features that artificially delay requests, such as rate limiting. It would be so much easier if we could just use Semaphores and other blocking primitives.
> 
> How did the migration go?
> =========================
> 
> * good so far :-)
> * we can now use virtual threads for core processing with JDK>=21 when a run-time feature flag is enabled (still uses ThreadPoolExecutor otherwise)
> * network listeners are still NIO2 + RxJava: will be migrated next to use blocking IO (incl. SSL), followed by removal of RxJava
> * as predicted, we initially encountered thread-pinning and deadlocks due to synchronization, as well as significant numbers of carrier threads due to use of Object.wait(). File IO was guarded by locking at a higher level, so I don't think it was directly causing expansion of the carrier thread pool
> * migration was particularly problematic in our third-party embedded key-value DB, which was using synchronization and Object.wait() extensively as part of its locking design. We were forced to fork the library in order to make the necessary changes. This was pretty straightforward though.
> 
> ReentrantLock vs synchronized
> =============================
> 
> When NOT using virtual threads we have noticed a ~8% throughput reduction in one of our multi-index write stress tests which appears to be due to contention on a single critical section responsible for serializing access to the embedded DB's transaction log.
> Profiling shows this critical section is being entered approximately 500K/s across 16 threads. When the critical section is protected with synchronized() I get the following results once warmed up:
> 
> -------------------------------------------------------------------------------
> |     Throughput    |                 Response Time                |          | 
> |    (ops/second)   |                (milliseconds)                |          | 
> |   recent  average |   recent  average    99.9%   99.99%  99.999% |  err/sec | 
> -------------------------------------------------------------------------------
> ...
> |   9002.6   9002.8 |    4.433    4.432    40.63    94.37   114.29 |      0.0 | 
> |   9132.8   9014.6 |    4.370    4.426    40.37    87.56   114.29 |      0.0 | 
> 
> Yet when the synchronized block is replaced with a ReentrantLock:
> 
> -------------------------------------------------------------------------------
> |     Throughput    |                 Response Time                |          | 
> |    (ops/second)   |                (milliseconds)                |          | 
> |   recent  average |   recent  average    99.9%   99.99%  99.999% |  err/sec | 
> -------------------------------------------------------------------------------
> ...
> |   8431.2   8378.4 |    4.733    4.763    51.12   114.82   122.16 |      0.0 | 
> |   8200.4   8342.8 |    4.866    4.784    46.66   114.29   122.16 |      0.0 | 
> 
> These results are on my laptop (Intel i9-13900H with 10 cores, Linux 6.5, JDK 21.0.3), but we see similar regressions on larger older machines in our labs. Also worthy of note is that the context switching rate increases from around 180K/s when using synchronized() to 300K/s when using ReentrantLock. I know that ReentrantLock is supposed to be a little bit less efficient in highly contended situations, but I was surprised by an 8-10% impact. Is that expected?
> 
> Analyzing contention using JFR.view
> ===================================
> 
> On a related note, I attempted to use the new JFR.view command to analyze the lock contention in both situations:
> 
>     jcmd <pid> JFR.view contention-by-site verbose=true
> 
> Sadly, it only analyzes JavaMonitorEnter JFR events and completely ignores Thread park events, so I was unable to analyze contention due to ReentrantLock using this tool. Here's the query JFR.view uses:
> 
> COLUMN 'StackTrace', 'Count', 'Avg.', 'Max.' SELECT stackTrace AS S, COUNT(*), AVERAGE(duration), MAXIMUM(duration) AS M FROM JavaMonitorEnter GROUP BY S ORDER BY M
> 
> Do you think this is a bug/RFE? If so, what's the best way to proceed in your opinion? It would be great if tools like JFR.view had better support for j.u.c primitives given that virtual threads will encourage increased usage of them.
> 
> Kind regards,
> Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240621/a2e92b28/attachment.htm>