Virtual Threads support in Quarkus - current integration and ideal

Ron Pressler ron.pressler at oracle.com
Tue Jul 26 15:10:41 UTC 2022


Hi and thank you very much for your report!

It has been our experience as well that trying to marry an asynchronous engine with virtual threads is cumbersome and often wasteful. Writing the entire pipeline with simple blocking in mind gave us not only superior performance, but a much smaller and simpler codebase, and that would be the approach I’d recommend. I expect that there will soon be HTTP servers demonstrating that simple approach. However, if you wish to use an existing async engine, I think the approach you’ve taken — spawning/unblocking a virtual thread running in the virtual thread scheduler — is probably the best one.

Integrating explicit scheduler loops with virtual thread via custom schedulers is on the roadmap, but, encouraged by the performance of servers that go the “full simple” approach, this might not be a top priority and might take some time  [1]. The API was removed for the simple reason that it’s just not ready, as you noticed.

As for memory footprint, although this might not be the cause of your issue, it might interest you to know that we’re now working on dramatically reducing the footprint of virtual thread stacks. That work also wasn’t ready for 19, but is a higher priority than custom schedulers. So I’m interested to know how much of that excess footprint is due to virtual thread stacks (those would appear as jdk.internal.vm.StackChunk objects in your heap).

What I’d like to hear more about is pinning, and what common causes of it you see. I would also be interested to hear your thoughts about how much of it is due to ecosystem readiness (e.g. some JDBC drivers don’t pin while others still do, although that’s expected to change).

— Ron

[1]: The “mechanical sympathy” effects you alluded to are real but too small in comparison to the throughput increase of thread-per-request code for them to be an immediate focus, especially as a work-stealing scheduler has pretty decent mechanical sympathy already. On the other hand, there are other reasons to support custom schedulers (e.g. UI event threads) that might shift the priority balance.


On 26 Jul 2022, at 15:13, Clement Escoffier <clement.escoffier at redhat.com<mailto:clement.escoffier at redhat.com>> wrote:

Hello,

This email reports our observations around Loom, mainly in the context of Quarkus. It discusses the current approach and our plans.  We are sharing this information on our current success and challenges with Loom. Please let us know your thoughts and questions on our approach(es).

Context

Since the early days of the Loom project, we have been looking at various approaches to integrate Loom (mostly virtual threads) into Quarkus. Our goal was (and still is) to dispatch processing (HTTP requests, Kafka messages, gRPC calls) on virtual threads. Thus, the user would not have to think about blocking or not blocking (more on that later as it relates to the Quarkus architecture) and can write synchronous code without limiting the application's concurrency.

To achieve this, we need to dispatch the processing on virtual threads but also have compliant clients to invoke remote services (HTTP, gRPC…), send messages (Kafka, AMQP), or interact with a data store (SQL or NoSQL).

Quarkus Architecture

Before going further, we need to explain how Quarkus is structured. Quarkus is based on a reactive engine (Netty + Eclipse Vert.x), so under the hood, Quarkus uses event loops to schedule the workloads and non-blocking I/O. There is also the possibility of using Netty Native Transport (epoll, kqueue, io_uring).

The processing can be either directly dispatched to the event loop or on a worker thread (OS thread). In the first case, the code must be written in an asynchronous and non-blocking manner. Quarkus proposes a programming model and safety guards to write such a code. In the latter case, the code can be blocking.

Quarkus decides which dispatching strategy it uses for each processing job. The decision is based on the method signatures and annotations (for example, the user can force it to be called on an event loop or a worker thread).

When using a worker thread, the request is received on an event loop and dispatched to the worker thread, and when the response is ready to be written (when it fits in memory), Quarkus switches back to the event loop.

The current approach

The integration of Loom's virtual threads is currently based[1] on a new annotation (@RunOnVirtualThread). It introduces a third dispatching strategy, and methods annotated with this annotation are called on a virtual thread. So, we now have three possibilities:


  *
Execute the processing on an event loop thread - the code must be non-blocking
  *
Execute the processing on an OS (worker) thread - with the thread cost and concurrency limit
  *
Execute the processing on a virtual thread

The following snippet shows an elementary example:

@GET
@Path("/loom")
@RunOnVirtualThread
Fortune example() {
     var list = repo.findAll();
     return pickOne(list);
}


This support is already experimentally available in Quarkus 2.10.

Previous attempts

The current approach is not our first attempt. We had two other approaches that we discarded,  while the second one is something we want to reconsider.

First Approach - All workers are virtual threads

The first approach was straightforward. The idea was to replace the worker (OS) threads with Virtual Threads. However, we quickly realized some limitations. Long-running (purely CPU-bound) processing would block the carrier thread as there is no preemption. While the user should be aware that long-running processing should not be executed on virtual threads, in this model, it was not explicit. We also started capturing carrier thread pinning situation (our current approach still has this issue, we will explain our bandaid later).

Second Approach - Marry event loops and carrier threads

Quarkus is designed to reduce the memory usage of the application. We are obsessed with RSS usage, especially when running in a container where resources are scarce. It has driven lots of our architecture choices, including the dimensioning strategies (number of event loops, number of worker threads…).

Thus, we investigated the possibility of avoiding having a second carrier thread pool and reducing the number of switches between the event loops and the carrier threads. We tried to use Netty event loops as carrier threads to achieve this. We had to use private APIs (which used to be public at some point in early access builds) to implement such an approach [3].

Unfortunately, we quickly ran into issues (explaining why our method is not part of the public API). Typically we had deadlock situations when a carrier thread shared locks with virtual threads. This made it impossible to use event-loops as carriers considering the probability of lock sharing.

That custom scheduling strategy also prevents work stealing (Netty event loops do not handle work stealing) and must keep a strict ordering between I/O tasks.

Pros and Cons of the current approach

Our current approach (based on @RunOnVirtualThread) integrates smoothly with the rest of Quarkus (even if the integration is limited to the HTTP part at that moment, as the integration with Kafka and gRPC are slightly more complicated but not impossible).
The user's code is written synchronously, and the users are aware of the dispatching strategy. Due to the limitation mentioned before, we still believe it's a good trade-off, even if not ideal.

However, the chances of pinning the carrier threads are still very high (caused by pervasive usage in the ecosystem of certain common JDK features - synchronized, JNI, etc.). Because we would like to reduce the number of carrier threads to the bare minimum (to limit the RSS usage), we can end up with an underperforming application, which would have a concurrency level lower than the classic worker thread approach with pretty lousy response times.

The Netty / Loom dance

We implemented a bandaid to reduce the chance of pinning while not limiting the users to a small set of Quarkus APIs. Remember, Quarkus is based on a reactive core, and most of our APIs are available in two forms:

  *
An imperative form blocking the caller thread when dealing with I/O
  *
A reactive form that is non-blocking (reactive programming)

To avoid thread pinning when running on a virtual thread, we offer the possibility to use the reactive form of our APIs but block the virtual thread while the result is still being computed. These awaits do not block the carrier thread and can be used with API returning 0 or one result, but also with streams (like Kafka topics) where you receive an iterator.

As said above, this is a band-aid until we have non-pinning clients/APIs.

Under the hood, there is a dance between the virtual thread and the netty event loop (used by the reactive API). It introduces a few unfortunate switches but workaround the pinning issue.

Observations

Over the past year, we ran many tests to design and implement our integration. The current approach is far from ideal, but it works fine.
We have excellent results when we compared with a full reactive approach and a worker approach (Quarkus can have the three variants in the same app).
The response time under load is close enough to the reactive approach. It is far better than the classic worker thread approach [1][2].

However (remember we are obsessed with RSS), the RSS usage is very high. Even higher than the worker thread approach. At that moment, we are investigating where these objects come from. We hope to have a better understanding after the summer. Our observations show that the performance penalty is likely due to memory consumption (and GC cycles). However, as said, we are still investigating.

Ideally

For us (Quarkus) and probably several other Java frameworks based on Netty, it would be terrific if we could find a way to reconcile the two scheduling strategies (in a sense, we would use the event loops as carrier thread). Of course, there will be trade-offs and limitations. Our initial attempt didn't end well, but that does not mean it's a dead end.

An event-loop carrier thread would greatly benefit the underlying reactive engine (Netty/Vert.x in the case of Quarkus). It retains some event-loop execution semantics: code is multithreaded (in the virtual thread meaning) yet executed with a single carrier thread that respects the event-loop principles and shall have decent mechanical sympathy. In addition, it should enable using classic blocking constructs (e.g., java.util.lock.Lock), whereas currently, it can only block on Vert.x (e.g., a Vert.x futures but not java.util.lock.Lock) as Vert.x needs to be aware of the thread suspension to schedule event dispatching in a race-free / deadlock-free manner.

With such integration, virtual threads would be executed on the event loop. When they "block", they would be unmounted, and I/O or another virtual thread would be processed. That would reduce the number of switches between threads, reduce RSS usage, and allow lots of Java frameworks to leverage Loom virtual threads quickly. Of course, this approach can only be validated empirically. Typically, it adds latency to every virtual thread dispatch. In addition, watchdogs would need to be implemented to prevent (or at least warn the user) the execution of long CPU-intensive actions that do not yield in an acceptable time.

Conclusion

Our integration of Loom virtual threads in Quarkus is already available to our users, and we will be collecting feedback.

As explained in this email, we have thus identified two issues.

The first one is purely about performance, and we were able to measure it empirically: the interaction between Loom and the Netty/Vert.x reactive stack seems to create an abundance of data structures that put pressure on the GC and degrade the overall performance of the application. As said above, we are investigating.

The second one is more general and also impacts programming with Quarkus/Vert.x Loom. The goal is to reconcile the scheduling strategies of Loom and Netty/Vert.x. This could improve performance by decreasing the number of context switches (Loom-Netty dance) and the RSS of an application. Moreover, it would enable the use of classic blocking constructs in Vert.x directly -i.e., without wrapping them in Vert.x own abstractions). We could not validate and/or characterize the performance improvement of such a model yet. The result is unclear as we don’t know if the decrease in context switches would be outweighed by the additional latency in virtual threads dispatch.

We are sharing this information on our current success and challenges with Loom.

 Please let us know your thoughts and concerns on our approach(es). Thanks!

Clement


[1] - https://developers.redhat.com/devnation/tech-talks/integrate-loom-quarkus
[2] - https://github.com/anavarr/fortunes_benchmark
[3] - https://github.com/openjdk/loom/commit/cad26ce74c98e28854f02106117fe03741f69ba0


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20220726/c6d6592e/attachment-0001.htm>


More information about the loom-dev mailing list