Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Fri Jan 23 13:42:20 UTC 2026

The question is why I need to use a semaphore to control the number of
concurrently running tasks. In my particular example, the goal is simply to
keep the concurrency level the same across different thread pool
implementations so I can fairly compare which one completes all the tasks
faster. This isn't solely about memory consumption—purely from a
**performance** perspective (e.g., total throughput or wall-clock time to
finish the workload), the same number of concurrent tasks completes
noticeably faster when using pooled virtual threads.

My email probably didn't explain this clearly enough. In reality, I have
two main questions:

1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g., to
hold expensive reusable objects like connections, formatters, or parsers),
is switching to a **pooled virtual thread executor** the only viable
solution—assuming we cannot modify the third-party library code?

2. When running the exact same number of concurrent tasks, pooled virtual
threads deliver better performance.

Both questions point toward the same conclusion: for an application
originally built around a traditional platform thread pool, after upgrading
to JDK 21/25, moving to a **pooled virtual threads** approach is generally
superior to simply using non-pooled (unbounded) virtual threads.

If any part of this reasoning or conclusion is mistaken, I would really
appreciate being corrected — thank you very much in advance for any
feedback or different experiences you can share!

Best Regards.
Jianbin Chen, github-id: funky-eyes

robert engels <robaho at me.com> 于 2026年1月23日周五 20:58写道：

> Exactly, this is your problem. The total number of tasks will all be
> running at once in the thread per task model.
>
> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <jianbin at apache.org> wrote:
>
> 
> Hi Robert,
>
> Thanks you, but I'm a bit confused. In the example above, I only set the
> core pool size to 200 virtual threads, but for the specific test case we’re
> talking about, the concurrency isn’t actually being limited by the pool
> size at all. Since the maximum thread count is Integer.MAX_VALUE and it’s
> using a SynchronousQueue, tasks are handed off immediately and a new thread
> gets created to run them right away anyway.
>
> Best Regards.
> Jianbin Chen, github-id: funky-eyes
>
> robert engels <robaho at me.com> 于 2026年1月23日周五 20:28写道：
>
>> Try using a semaphore to limit the maximum number of tasks in progress at
>> anyone time - that is what is causing your memory spike. Think of it this
>> way since VT threads are so cheap to create - you are essentially creating
>> them all at once - making the working set size equally to the maximum.  So
>> you have N * WSS, where as in the other you have POOLSIZE * WSS.
>>
>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>
>> 
>> Hi Alan,
>>
>> Thanks for your reply and for mentioning JEP 444.
>> I’ve gone through the guidance in JEP 444 and have some understanding of
>> it — which is exactly why I’m feeling a bit puzzled in practice and would
>> really like to hear your thoughts.
>>
>> Background — ThreadLocal example (Aerospike)
>> ```java
>> private static final ThreadLocal<byte[]> BufferThreadLocal = new
>> ThreadLocal<byte[]>() {
>>     @Override
>>     protected byte[] initialValue() {
>>         return new byte[DefaultBufferSize];
>>     }
>> };
>> ```
>> This Aerospike code allocates a default 8KB byte[] whenever a new thread
>> is created and stores it in a ThreadLocal for per-thread caching.
>>
>> My concern
>> - With a traditional platform-thread pool, those ThreadLocal byte[]
>> instances are effectively reused because threads are long-lived and pooled.
>> - If we switch to creating a brand-new virtual thread per task (no
>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], which
>> leads to many short-lived 8KB allocations.
>> - That raises allocation rate and GC pressure (despite collectors like
>> ZGC), because ThreadLocal caches aren’t reused when threads are ephemeral.
>>
>> So my question is: for applications originally designed around
>> platform-thread pools, wouldn’t partially pooling virtual threads be
>> beneficial? For example, Tomcat’s default max threads is 200 — if I keep a
>> pool of 200 virtual threads, then when load exceeds that core size, a
>> SynchronousQueue will naturally cause new virtual threads to be created on
>> demand. This seems to preserve the behavior that ThreadLocal-based
>> libraries expect, without losing the ability to expand under spikes. Since
>> virtual threads are very lightweight, pooling a reasonable number (e.g.,
>> 200) seems to have negligible memory downside while retaining ThreadLocal
>> cache effectiveness.
>>
>> Empirical test I ran
>> (I ran a microbenchmark comparing an unpooled per-task virtual-thread
>> executor and a ThreadPoolExecutor that keeps 200 core virtual threads.)
>>
>> ```java
>> public static void main(String[] args) throws InterruptedException {
>>     Executor executor =
>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
>> 1).factory());
>>     Executor executor2 = new ThreadPoolExecutor(
>>         200,
>>         Integer.MAX_VALUE,
>>         0L,
>>         java.util.concurrent.TimeUnit.SECONDS,
>>         new SynchronousQueue<>(),
>>         Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>     );
>>
>>     // Warm-up
>>     for (int i = 0; i < 10100; i++) {
>>         executor.execute(() -> {
>>             // simulate I/O wait
>>             try { Thread.sleep(100); } catch (InterruptedException e) {
>> throw new RuntimeException(e); }
>>         });
>>         executor2.execute(() -> {
>>             // simulate I/O wait
>>             try { Thread.sleep(100); } catch (InterruptedException e) {
>> throw new RuntimeException(e); }
>>         });
>>     }
>>
>>     // Ensure JIT + warm-up complete
>>     Thread.sleep(5000);
>>
>>     long start = System.currentTimeMillis();
>>     CountDownLatch countDownLatch = new CountDownLatch(50000);
>>     for (int i = 0; i < 50000; i++) {
>>         executor.execute(() -> {
>>             try { Thread.sleep(100); countDownLatch.countDown(); } catch
>> (InterruptedException e) { throw new RuntimeException(e); }
>>         });
>>     }
>>     countDownLatch.await();
>>     System.out.println("thread time: " + (System.currentTimeMillis() -
>> start) + " ms");
>>
>>     start = System.currentTimeMillis();
>>     CountDownLatch countDownLatch2 = new CountDownLatch(50000);
>>     for (int i = 0; i < 50000; i++) {
>>         executor2.execute(() -> {
>>             try { Thread.sleep(100); countDownLatch2.countDown(); } catch
>> (InterruptedException e) { throw new RuntimeException(e); }
>>         });
>>     }
>>     countDownLatch.await();
>>     System.out.println("thread pool time: " + (System.currentTimeMillis()
>> - start) + " ms");
>> }
>> ```
>>
>> Result summary
>> - In my runs, the pooled virtual-thread executor (executor2) performed
>> better than the unpooled per-task virtual-thread executor.
>> - Even when I increased load by 10x or 100x, the pooled virtual-thread
>> executor still showed better performance.
>> - In realistic workloads, it seems pooling some virtual threads reduces
>> allocation/GC overhead and improves throughput compared to strictly
>> unpooled virtual threads.
>>
>> Final thought / request for feedback
>> - From my perspective, for systems originally tuned for platform-thread
>> pools, partially pooling virtual threads seems to have no obvious downside
>> and can restore ThreadLocal cache effectiveness used by many third-party
>> libraries.
>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread
>> semantics, or ThreadLocal behavior, please point out what I’m missing. I’d
>> appreciate your guidance.
>>
>> Best Regards.
>> Jianbin Chen, github-id: funky-eyes
>>
>> Alan Bateman <alan.bateman at oracle.com> 于 2026年1月23日周五 17:27写道：
>>
>>> On 23/01/2026 07:30, Jianbin Chen wrote:
>>> > :
>>> >
>>> > So my question is:
>>> >
>>> > **In scenarios where third-party libraries heavily rely on ThreadLocal
>>> > for caching / buffering (and we cannot change those libraries to use
>>> > object pools instead), is explicitly pooling virtual threads (using a
>>> > ThreadPoolExecutor with virtual thread factory) considered a
>>> > recommended / acceptable workaround?**
>>> >
>>> > Or are there better / more idiomatic ways to handle this kind of
>>> > compatibility issue with legacy ThreadLocal-based libraries when
>>> > migrating to virtual threads?
>>> >
>>> > I have already opened a related discussion in the Dubbo project (since
>>> > Dubbo is one of the libraries affected in our stack):
>>> >
>>> > https://github.com/apache/dubbo/issues/16042
>>> >
>>> > Would love to hear your thoughts — especially from people who have
>>> > experience running large-scale virtual-thread-based services with
>>> > mixed third-party dependencies.
>>> >
>>>
>>> The guidelines that we put in JEP 444 [1] is to not pool virtual threads
>>> and to avoid caching costing resources in thread locals. Virtual threads
>>> support thread locals of course but that is not useful when some library
>>> is looking to share a costly resource between tasks that run on the same
>>> thread in a thread pool.
>>>
>>> I don't know anything about Aerospike but working with the maintainers
>>> of that library to re-work its buffer management seems like the right
>>> course of action here. Your mail says "byte buffers". If this is
>>> ByteBuffer it might be that they are caching direct buffers as they are
>>> expensive to create (and managed by the GC). Maybe they could look at
>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory
>>> segment) and allocate from an arena that better matches the lifecycle.
>>>
>>> Hopefully others will share their experiences with migration as it is
>>> indeed challenging to migrate code developed for thread pools to work
>>> efficiently on virtual threads where there is 1-1 relationship between
>>> the task to execute and the thread.
>>>
>>> -Alan
>>>
>>> [1] https://openjdk.org/jeps/444#Thread-local-variables
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20260123/103a5654/attachment-0001.htm>