Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Sat Jan 24 06:21:30 UTC 2026

Hi Francesco and Robert,

First, thank you both for your patient answers and replies — I will look
into the libraries you provided and study them. But first let’s return to
the subject of this email.

When I first encountered the bottleneck with non‑pooled virtual threads, it
was while designing a REST‑API style proxy that translates incoming HTTP
requests into corresponding Aerospike requests (using the Aerospike
client). The relevant flow is: Netty’s I/O thread parses and constructs the
HTTP request, hands it to my business handler, and the handler constructs
the business executor using Executors.newVirtualThreadPerTaskExecutor()
(i.e., a non‑pooled virtual‑thread executor). That is where I discovered
the problem: my application suffered very high GC and CPU pressure because
every HTTP request allocated an 8KB byte[] that could not be reused, so the
overall behavior was sometimes worse than using platform threads.

So, I tried pooling virtual threads experimentally and found that
throughput increased noticeably. Even when concurrency exceeded my core
thread count, the use of a SynchronousQueue allowed the system to handle
traffic spikes reasonably well (although virtual threads created beyond the
core still produced unreusable 8KB byte[] allocations). Compared to a
platform thread pool, latency (RT) was smoother and CPU usage was more
stable, because when core threads were insufficient we did not pay the high
cost of creating additional platform threads that would otherwise spike CPU
and memory. In that scenario pooled virtual threads > non‑pooled virtual
threads.

After that I wrote the benchmark from my earlier email, and the test still
shows pooled virtual threads outperforming non‑pooled virtual threads.

So I have two questions:

1) When an application is developed against JDK 21+ and many of its
dependent third‑party libraries use ThreadLocal for cachepool
implementations, is pooling virtual threads recommended (or at least
acceptable such that it breaks the JEP 444 recommendation)?

2) Why do pooled virtual threads still perform better in my examples? If
that is the case, it looks like pooling virtual threads only violates the
principle of least surprise but otherwise has no harmful effects. Moreover,
as long as the pooled virtual threads are elastically expandable rather
than fixed (for example: new ThreadPoolExecutor(200, Integer.MAX_VALUE, 0L,
TimeUnit.SECONDS, new SynchronousQueue<>(), Thread.ofVirtual().factory());
— i.e., only pooling 200 virtual threads while allowing additional threads
to grow dynamically with task demand), it seems to be all upside and no
downside.

Please correct me if any of my assumptions are wrong. Thanks.

Best regards,
Jianbin Chen

Robert Engels <rengels at ix.netcom.com> 于2026年1月24日周六 00:17写道：

> You might want to look at https://github.com/robaho/httpserver.
> Theres some discussion on why designing systems specifically for VT is the
> way to go.
>
> Most of the high performance http servers are being rewritten specifically
> for the thread per request model since it’s far simpler, and often superior
> in performance.
>
> On Jan 23, 2026, at 9:49 AM, Jianbin Chen <jianbin at apache.org> wrote:
>
> 
> Hi Peter,
>
> Thank you, I completely agree with you about the principle of least
> surprise, and I also agree that making local optimizations inside a library
> can make it hard for consumers to tune performance—it's like providing a
> pooled thread pool with imposed maximum virtual thread limits, which will
> confuse users. So I fully accept the example you gave.
>
> What I need to emphasize is that in the scenario I'm describing I am
> actually the entry point for all requests. By that I mean the thread pool I
> choose at the entry—e.g., the business thread pool that Netty hands work to
> after the I/O threads, or the request-handling pool in Tomcat—is the true
> starting point for processing every request. Given that entry-point
> threading model, wouldn't the advantages of pooling virtual threads be even
> greater, as I argued?
>
> Best Regards.
> Jianbin Chen, github-id: funky-eyes
>
> Peter Eastham <petereastham at gmail.com> 于 2026年1月23日周五 22:33写道：
>
>> Hey Jianbin,
>>
>> A java library should leverage the expected defaults for executors. This
>> would be a traditional threadpool for platform threads, or virtual thread
>> per task. This follows the principle of least surprise for the library
>> consumers.
>> Performance Conscious libraries should allow for the Executor to be
>> provided by the calling code. One example of that approach is the
>> CompletableFuture API. However some sort of Configuration Object with an
>> executor as a field is another approach.
>>
>> If you locally optimize your library like this, that'll make it harder
>> for your library consumer to holistically optimize their code.
>>
>> Does that make sense?
>> -Peter
>>
>> On Fri, Jan 23, 2026 at 7:10 AM Jianbin Chen <jianbin at apache.org> wrote:
>>
>>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough.
>>> Here's the code in question:
>>>
>>> ```java
>>> Executor executor2 = new ThreadPoolExecutor(
>>>     200,
>>>     Integer.MAX_VALUE,
>>>     0L,
>>>     java.util.concurrent.TimeUnit.SECONDS,
>>>     new SynchronousQueue<>(),
>>>     Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>> );
>>> ```
>>>
>>> In this example, the pooled virtual threads don't implement any
>>> backpressure mechanism; they simply maintain a core pool of 200 virtual
>>> threads. Given that the queue is a `SynchronousQueue` and the maximum pool
>>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200,
>>> its behavior becomes identical to that of non-pooled virtual threads.
>>>
>>> From my perspective, this example demonstrates that the benefits of
>>> pooling virtual threads outweigh those of creating a new virtual thread for
>>> every single task. In IO-bound scenarios, the virtual threads are directly
>>> reused rather than being recreated each time, and the memory footprint of
>>> virtual threads is far smaller than that of platform threads (which are
>>> controlled by the `-Xss` flag). Additionally, with pooled virtual threads,
>>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can
>>> also be reused, which further reduces overall memory usage—wouldn't you
>>> agree?
>>>
>>> Best Regards.
>>> Jianbin Chen, github-id: funky-eyes
>>>
>>> Robert Engels <rengels at ix.netcom.com> 于 2026年1月23日周五 21:52写道：
>>>
>>>> Because VT are so efficient to create, without any back pressure they
>>>> will all be created and running at essentially the same time (dramatically
>>>> raising the amount of memory in use) - versus with a pool of size N you
>>>> will have at most N running at once. In a REAL WORLD application there are
>>>> often external limiters (like number of tcp connections) that provide a
>>>> limit.
>>>>
>>>> If your tasks are purely cpu bound you should probably be using a
>>>> capped thread pool of platform threads as it makes no sense to have more
>>>> threads than available cores.
>>>>
>>>>
>>>>
>>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>>>
>>>> 
>>>> The question is why I need to use a semaphore to control the number of
>>>> concurrently running tasks. In my particular example, the goal is simply to
>>>> keep the concurrency level the same across different thread pool
>>>> implementations so I can fairly compare which one completes all the tasks
>>>> faster. This isn't solely about memory consumption—purely from a
>>>> **performance** perspective (e.g., total throughput or wall-clock time to
>>>> finish the workload), the same number of concurrent tasks completes
>>>> noticeably faster when using pooled virtual threads.
>>>>
>>>> My email probably didn't explain this clearly enough. In reality, I
>>>> have two main questions:
>>>>
>>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g.,
>>>> to hold expensive reusable objects like connections, formatters, or
>>>> parsers), is switching to a **pooled virtual thread executor** the only
>>>> viable solution—assuming we cannot modify the third-party library code?
>>>>
>>>> 2. When running the exact same number of concurrent tasks, pooled
>>>> virtual threads deliver better performance.
>>>>
>>>> Both questions point toward the same conclusion: for an application
>>>> originally built around a traditional platform thread pool, after upgrading
>>>> to JDK 21/25, moving to a **pooled virtual threads** approach is generally
>>>> superior to simply using non-pooled (unbounded) virtual threads.
>>>>
>>>> If any part of this reasoning or conclusion is mistaken, I would really
>>>> appreciate being corrected — thank you very much in advance for any
>>>> feedback or different experiences you can share!
>>>>
>>>> Best Regards.
>>>> Jianbin Chen, github-id: funky-eyes
>>>>
>>>> robert engels <robaho at me.com> 于 2026年1月23日周五 20:58写道：
>>>>
>>>>> Exactly, this is your problem. The total number of tasks will all be
>>>>> running at once in the thread per task model.
>>>>>
>>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>>>>
>>>>> 
>>>>> Hi Robert,
>>>>>
>>>>> Thanks you, but I'm a bit confused. In the example above, I only set
>>>>> the core pool size to 200 virtual threads, but for the specific test case
>>>>> we’re talking about, the concurrency isn’t actually being limited by the
>>>>> pool size at all. Since the maximum thread count is Integer.MAX_VALUE and
>>>>> it’s using a SynchronousQueue, tasks are handed off immediately and a new
>>>>> thread gets created to run them right away anyway.
>>>>>
>>>>> Best Regards.
>>>>> Jianbin Chen, github-id: funky-eyes
>>>>>
>>>>> robert engels <robaho at me.com> 于 2026年1月23日周五 20:28写道：
>>>>>
>>>>>> Try using a semaphore to limit the maximum number of tasks in
>>>>>> progress at anyone time - that is what is causing your memory spike. Think
>>>>>> of it this way since VT threads are so cheap to create - you are
>>>>>> essentially creating them all at once - making the working set size equally
>>>>>> to the maximum.  So you have N * WSS, where as in the other you have
>>>>>> POOLSIZE * WSS.
>>>>>>
>>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>>>>>
>>>>>> 
>>>>>> Hi Alan,
>>>>>>
>>>>>> Thanks for your reply and for mentioning JEP 444.
>>>>>> I’ve gone through the guidance in JEP 444 and have some understanding
>>>>>> of it — which is exactly why I’m feeling a bit puzzled in practice and
>>>>>> would really like to hear your thoughts.
>>>>>>
>>>>>> Background — ThreadLocal example (Aerospike)
>>>>>> ```java
>>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new
>>>>>> ThreadLocal<byte[]>() {
>>>>>>     @Override
>>>>>>     protected byte[] initialValue() {
>>>>>>         return new byte[DefaultBufferSize];
>>>>>>     }
>>>>>> };
>>>>>> ```
>>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new
>>>>>> thread is created and stores it in a ThreadLocal for per-thread caching.
>>>>>>
>>>>>> My concern
>>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[]
>>>>>> instances are effectively reused because threads are long-lived and pooled.
>>>>>> - If we switch to creating a brand-new virtual thread per task (no
>>>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], which
>>>>>> leads to many short-lived 8KB allocations.
>>>>>> - That raises allocation rate and GC pressure (despite collectors
>>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads are
>>>>>> ephemeral.
>>>>>>
>>>>>> So my question is: for applications originally designed around
>>>>>> platform-thread pools, wouldn’t partially pooling virtual threads be
>>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I keep a
>>>>>> pool of 200 virtual threads, then when load exceeds that core size, a
>>>>>> SynchronousQueue will naturally cause new virtual threads to be created on
>>>>>> demand. This seems to preserve the behavior that ThreadLocal-based
>>>>>> libraries expect, without losing the ability to expand under spikes. Since
>>>>>> virtual threads are very lightweight, pooling a reasonable number (e.g.,
>>>>>> 200) seems to have negligible memory downside while retaining ThreadLocal
>>>>>> cache effectiveness.
>>>>>>
>>>>>> Empirical test I ran
>>>>>> (I ran a microbenchmark comparing an unpooled per-task virtual-thread
>>>>>> executor and a ThreadPoolExecutor that keeps 200 core virtual threads.)
>>>>>>
>>>>>> ```java
>>>>>> public static void main(String[] args) throws InterruptedException {
>>>>>>     Executor executor =
>>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
>>>>>> 1).factory());
>>>>>>     Executor executor2 = new ThreadPoolExecutor(
>>>>>>         200,
>>>>>>         Integer.MAX_VALUE,
>>>>>>         0L,
>>>>>>         java.util.concurrent.TimeUnit.SECONDS,
>>>>>>         new SynchronousQueue<>(),
>>>>>>         Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>>>>>     );
>>>>>>
>>>>>>     // Warm-up
>>>>>>     for (int i = 0; i < 10100; i++) {
>>>>>>         executor.execute(() -> {
>>>>>>             // simulate I/O wait
>>>>>>             try { Thread.sleep(100); } catch (InterruptedException e)
>>>>>> { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>         executor2.execute(() -> {
>>>>>>             // simulate I/O wait
>>>>>>             try { Thread.sleep(100); } catch (InterruptedException e)
>>>>>> { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>     }
>>>>>>
>>>>>>     // Ensure JIT + warm-up complete
>>>>>>     Thread.sleep(5000);
>>>>>>
>>>>>>     long start = System.currentTimeMillis();
>>>>>>     CountDownLatch countDownLatch = new CountDownLatch(50000);
>>>>>>     for (int i = 0; i < 50000; i++) {
>>>>>>         executor.execute(() -> {
>>>>>>             try { Thread.sleep(100); countDownLatch.countDown(); }
>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>     }
>>>>>>     countDownLatch.await();
>>>>>>     System.out.println("thread time: " + (System.currentTimeMillis()
>>>>>> - start) + " ms");
>>>>>>
>>>>>>     start = System.currentTimeMillis();
>>>>>>     CountDownLatch countDownLatch2 = new CountDownLatch(50000);
>>>>>>     for (int i = 0; i < 50000; i++) {
>>>>>>         executor2.execute(() -> {
>>>>>>             try { Thread.sleep(100); countDownLatch2.countDown(); }
>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>     }
>>>>>>     countDownLatch.await();
>>>>>>     System.out.println("thread pool time: " +
>>>>>> (System.currentTimeMillis() - start) + " ms");
>>>>>> }
>>>>>> ```
>>>>>>
>>>>>> Result summary
>>>>>> - In my runs, the pooled virtual-thread executor (executor2)
>>>>>> performed better than the unpooled per-task virtual-thread executor.
>>>>>> - Even when I increased load by 10x or 100x, the pooled
>>>>>> virtual-thread executor still showed better performance.
>>>>>> - In realistic workloads, it seems pooling some virtual threads
>>>>>> reduces allocation/GC overhead and improves throughput compared to strictly
>>>>>> unpooled virtual threads.
>>>>>>
>>>>>> Final thought / request for feedback
>>>>>> - From my perspective, for systems originally tuned for
>>>>>> platform-thread pools, partially pooling virtual threads seems to have no
>>>>>> obvious downside and can restore ThreadLocal cache effectiveness used by
>>>>>> many third-party libraries.
>>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread
>>>>>> semantics, or ThreadLocal behavior, please point out what I’m missing. I’d
>>>>>> appreciate your guidance.
>>>>>>
>>>>>> Best Regards.
>>>>>> Jianbin Chen, github-id: funky-eyes
>>>>>>
>>>>>> Alan Bateman <alan.bateman at oracle.com> 于 2026年1月23日周五 17:27写道：
>>>>>>
>>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote:
>>>>>>> > :
>>>>>>> >
>>>>>>> > So my question is:
>>>>>>> >
>>>>>>> > **In scenarios where third-party libraries heavily rely on
>>>>>>> ThreadLocal
>>>>>>> > for caching / buffering (and we cannot change those libraries to
>>>>>>> use
>>>>>>> > object pools instead), is explicitly pooling virtual threads
>>>>>>> (using a
>>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a
>>>>>>> > recommended / acceptable workaround?**
>>>>>>> >
>>>>>>> > Or are there better / more idiomatic ways to handle this kind of
>>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when
>>>>>>> > migrating to virtual threads?
>>>>>>> >
>>>>>>> > I have already opened a related discussion in the Dubbo project
>>>>>>> (since
>>>>>>> > Dubbo is one of the libraries affected in our stack):
>>>>>>> >
>>>>>>> > https://github.com/apache/dubbo/issues/16042
>>>>>>> >
>>>>>>> > Would love to hear your thoughts — especially from people who have
>>>>>>> > experience running large-scale virtual-thread-based services with
>>>>>>> > mixed third-party dependencies.
>>>>>>> >
>>>>>>>
>>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual
>>>>>>> threads
>>>>>>> and to avoid caching costing resources in thread locals. Virtual
>>>>>>> threads
>>>>>>> support thread locals of course but that is not useful when some
>>>>>>> library
>>>>>>> is looking to share a costly resource between tasks that run on the
>>>>>>> same
>>>>>>> thread in a thread pool.
>>>>>>>
>>>>>>> I don't know anything about Aerospike but working with the
>>>>>>> maintainers
>>>>>>> of that library to re-work its buffer management seems like the
>>>>>>> right
>>>>>>> course of action here. Your mail says "byte buffers". If this is
>>>>>>> ByteBuffer it might be that they are caching direct buffers as they
>>>>>>> are
>>>>>>> expensive to create (and managed by the GC). Maybe they could look
>>>>>>> at
>>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory
>>>>>>> segment) and allocate from an arena that better matches the
>>>>>>> lifecycle.
>>>>>>>
>>>>>>> Hopefully others will share their experiences with migration as it
>>>>>>> is
>>>>>>> indeed challenging to migrate code developed for thread pools to
>>>>>>> work
>>>>>>> efficiently on virtual threads where there is 1-1 relationship
>>>>>>> between
>>>>>>> the task to execute and the thread.
>>>>>>>
>>>>>>> -Alan
>>>>>>>
>>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20260124/1d300777/attachment-0001.htm>