Virtual Threads support in Quarkus - current integration and ideal

Clement Escoffier clement.escoffier at redhat.com
Tue Jul 26 14:13:27 UTC 2022


Hello,

This email reports our observations around Loom, mainly in the context of
Quarkus. It discusses the current approach and our plans.  We are sharing
this information on our current success and challenges with Loom. Please
let us know your thoughts and questions on our approach(es).

Context

Since the early days of the Loom project, we have been looking at various
approaches to integrate Loom (mostly virtual threads) into Quarkus. Our
goal was (and still is) to dispatch processing (HTTP requests, Kafka
messages, gRPC calls) on virtual threads. Thus, the user would not have to
think about blocking or not blocking (more on that later as it relates to
the Quarkus architecture) and can write synchronous code without limiting
the application's concurrency.

To achieve this, we need to dispatch the processing on virtual threads but
also have compliant clients to invoke remote services (HTTP, gRPC…), send
messages (Kafka, AMQP), or interact with a data store (SQL or NoSQL).

Quarkus Architecture

Before going further, we need to explain how Quarkus is structured. Quarkus
is based on a reactive engine (Netty + Eclipse Vert.x), so under the hood,
Quarkus uses event loops to schedule the workloads and non-blocking I/O.
There is also the possibility of using Netty Native Transport (epoll,
kqueue, io_uring).

The processing can be either directly dispatched to the event loop or on a
worker thread (OS thread). In the first case, the code must be written in
an asynchronous and non-blocking manner. Quarkus proposes a programming
model and safety guards to write such a code. In the latter case, the code
can be blocking.

Quarkus decides which dispatching strategy it uses for each processing job.
The decision is based on the method signatures and annotations (for
example, the user can force it to be called on an event loop or a worker
thread).

When using a worker thread, the request is received on an event loop and
dispatched to the worker thread, and when the response is ready to be
written (when it fits in memory), Quarkus switches back to the event loop.

The current approach

The integration of Loom's virtual threads is currently based[1] on a new
annotation (@RunOnVirtualThread). It introduces a third dispatching
strategy, and methods annotated with this annotation are called on a
virtual thread. So, we now have three possibilities:


   -

   Execute the processing on an event loop thread - the code must be
   non-blocking
   -

   Execute the processing on an OS (worker) thread - with the thread cost
   and concurrency limit
   -

   Execute the processing on a virtual thread


The following snippet shows an elementary example:

@GET

@Path("/loom")

@RunOnVirtualThread

Fortune example() {

     var list = repo.findAll();

     return pickOne(list);

}


This support is already experimentally available in Quarkus 2.10.

Previous attempts

The current approach is not our first attempt. We had two other approaches
that we discarded,  while the second one is something we want to reconsider.

First Approach - All workers are virtual threads

The first approach was straightforward. The idea was to replace the worker
(OS) threads with Virtual Threads. However, we quickly realized some
limitations. Long-running (purely CPU-bound) processing would block the
carrier thread as there is no preemption. While the user should be aware
that long-running processing should not be executed on virtual threads, in
this model, it was not explicit. We also started capturing carrier thread
pinning situation (our current approach still has this issue, we will
explain our bandaid later).

Second Approach - Marry event loops and carrier threads

Quarkus is designed to reduce the memory usage of the application. We are
obsessed with RSS usage, especially when running in a container where
resources are scarce. It has driven lots of our architecture choices,
including the dimensioning strategies (number of event loops, number of
worker threads…).

Thus, we investigated the possibility of avoiding having a second carrier
thread pool and reducing the number of switches between the event loops and
the carrier threads. We tried to use Netty event loops as carrier threads
to achieve this. We had to use private APIs (which used to be public at
some point in early access builds) to implement such an approach [3].

Unfortunately, we quickly ran into issues (explaining why our method is not
part of the public API). Typically we had deadlock situations when a
carrier thread shared locks with virtual threads. This made it impossible
to use event-loops as carriers considering the probability of lock sharing.

That custom scheduling strategy also prevents work stealing (Netty event
loops do not handle work stealing) and must keep a strict ordering between
I/O tasks.

Pros and Cons of the current approach

Our current approach (based on @RunOnVirtualThread) integrates smoothly
with the rest of Quarkus (even if the integration is limited to the HTTP
part at that moment, as the integration with Kafka and gRPC are slightly
more complicated but not impossible).

The user's code is written synchronously, and the users are aware of the
dispatching strategy. Due to the limitation mentioned before, we still
believe it's a good trade-off, even if not ideal.

However, the chances of pinning the carrier threads are still very high (caused
by pervasive usage in the ecosystem of certain common JDK features -
synchronized, JNI, etc.). Because we would like to reduce the number of
carrier threads to the bare minimum (to limit the RSS usage), we can end up
with an underperforming application, which would have a concurrency level
lower than the classic worker thread approach with pretty lousy response
times.

The Netty / Loom dance

We implemented a bandaid to reduce the chance of pinning while not limiting
the users to a small set of Quarkus APIs. Remember, Quarkus is based on a
reactive core, and most of our APIs are available in two forms:

   -

   An imperative form blocking the caller thread when dealing with I/O
   -

   A reactive form that is non-blocking (reactive programming)


To avoid thread pinning when running on a virtual thread, we offer the
possibility to use the reactive form of our APIs but block the virtual
thread while the result is still being computed. These awaits do not block
the carrier thread and can be used with API returning 0 or one result, but
also with streams (like Kafka topics) where you receive an iterator.

As said above, this is a band-aid until we have non-pinning clients/APIs.

Under the hood, there is a dance between the virtual thread and the netty
event loop (used by the reactive API). It introduces a few unfortunate
switches but workaround the pinning issue.

Observations

Over the past year, we ran many tests to design and implement our
integration. The current approach is far from ideal, but it works fine.

We have excellent results when we compared with a full reactive approach
and a worker approach (Quarkus can have the three variants in the same app).

The response time under load is close enough to the reactive approach. It
is far better than the classic worker thread approach [1][2].

However (remember we are obsessed with RSS), the RSS usage is very high.
Even higher than the worker thread approach. At that moment, we are
investigating where these objects come from. We hope to have a better
understanding after the summer. Our observations show that the performance
penalty is likely due to memory consumption (and GC cycles). However, as
said, we are still investigating.

Ideally

For us (Quarkus) and probably several other Java frameworks based on Netty,
it would be terrific if we could find a way to reconcile the two scheduling
strategies (in a sense, we would use the event loops as carrier thread). Of
course, there will be trade-offs and limitations. Our initial attempt
didn't end well, but that does not mean it's a dead end.

An event-loop carrier thread would greatly benefit the underlying reactive
engine (Netty/Vert.x in the case of Quarkus). It retains some event-loop
execution semantics: code is multithreaded (in the virtual thread meaning)
yet executed with a single carrier thread that respects the event-loop
principles and shall have decent mechanical sympathy. In addition, it
should enable using classic blocking constructs (e.g., java.util.lock.Lock),
whereas currently, it can only block on Vert.x (e.g., a Vert.x futures but
not java.util.lock.Lock) as Vert.x needs to be aware of the thread
suspension to schedule event dispatching in a race-free / deadlock-free
manner.

With such integration, virtual threads would be executed on the event loop.
When they "block", they would be unmounted, and I/O or another virtual
thread would be processed. That would reduce the number of switches between
threads, reduce RSS usage, and allow lots of Java frameworks to leverage
Loom virtual threads quickly. Of course, this approach can only be
validated empirically. Typically, it adds latency to every virtual thread
dispatch. In addition, watchdogs would need to be implemented to prevent
(or at least warn the user) the execution of long CPU-intensive actions
that do not yield in an acceptable time.

Conclusion

Our integration of Loom virtual threads in Quarkus is already available to
our users, and we will be collecting feedback.

As explained in this email, we have thus identified two issues.

The first one is purely about performance, and we were able to measure it
empirically: the interaction between Loom and the Netty/Vert.x reactive
stack seems to create an abundance of data structures that put pressure on
the GC and degrade the overall performance of the application. As said
above, we are investigating.

The second one is more general and also impacts programming with
Quarkus/Vert.x Loom. The goal is to reconcile the scheduling strategies of
Loom and Netty/Vert.x. This could improve performance by decreasing the
number of context switches (Loom-Netty dance) and the RSS of an
application. Moreover, it would enable the use of classic blocking
constructs in Vert.x directly -i.e., without wrapping them in Vert.x own
abstractions). We could not validate and/or characterize the performance
improvement of such a model yet. The result is unclear as we don’t know if
the decrease in context switches would be outweighed by the additional
latency in virtual threads dispatch.

We are sharing this information on our current success and challenges with
Loom.

 Please let us know your thoughts and concerns on our approach(es). Thanks!


Clement



[1] -
https://developers.redhat.com/devnation/tech-talks/integrate-loom-quarkus

[2] - https://github.com/anavarr/fortunes_benchmark

[3] -
https://github.com/openjdk/loom/commit/cad26ce74c98e28854f02106117fe03741f69ba0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20220726/f81a7b2f/attachment-0001.htm>


More information about the loom-dev mailing list