Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Fri Jan 23 15:29:48 UTC 2026

Hi Francesco,

I'd like to know if there's a similar issue in JDK 21？

Best Regards.
Jianbin Chen, github-id: funky-eyes

Francesco Nigro <nigro.fra at gmail.com> 于 2026年1月23日周五 23:14写道：

> In the original code snippet I see named (with a counter) VThreads, so, be
> aware of https://bugs.openjdk.org/browse/JDK-8372410
>
> Il giorno ven 23 gen 2026 alle ore 15:52 Jianbin Chen <jianbin at apache.org>
> ha scritto:
>
>> I'm sorry — I forgot to mention the machine I used for the load test. My
>> server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under my
>> test load (about 20,000 QPS), with non‑pooled virtual threads you generate
>> at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just from
>> that 8 KB buffer; that doesn't include other object allocations. With a
>> 2880 MB heap this allocation rate already forces very frequent GC, and
>> frequent GC raises CPU usage, which in turn significantly increases average
>> response time and p99 / p999 latency.
>>
>> Pooling is usually introduced to solve performance issues — object pools
>> and connection pools exist to quickly reuse cached resources and improve
>> performance. So pooling virtual threads also yields obvious benefits,
>> especially for memory‑constrained, I/O‑bound applications (gateways,
>> proxies, etc.) that are sensitive to latency.
>>
>> Best Regards.
>> Jianbin Chen, github-id: funky-eyes
>>
>> Robert Engels <rengels at ix.netcom.com> 于 2026年1月23日周五 22:20写道：
>>
>>> I understand. I was trying explain how you can not use thread locals and
>>> maintain the performance. It’s unlikely allocating a 8k buffer is a
>>> performance bottleneck in a real program if the task is not cpu bound
>>> (depending on the granularity you make your tasks) - but 2M tasks running
>>> simultaneously would require 16gb of memory not including the stack.
>>>
>>> You cannot simply use the thread per task model without an understanding
>>> of the cpu, IO, and memory footprints of your tasks and then configure
>>> appropriately.
>>>
>>> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>>
>>> 
>>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough.
>>> Here's the code in question:
>>>
>>> ```java
>>> Executor executor2 = new ThreadPoolExecutor(
>>>     200,
>>>     Integer.MAX_VALUE,
>>>     0L,
>>>     java.util.concurrent.TimeUnit.SECONDS,
>>>     new SynchronousQueue<>(),
>>>     Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>> );
>>> ```
>>>
>>> In this example, the pooled virtual threads don't implement any
>>> backpressure mechanism; they simply maintain a core pool of 200 virtual
>>> threads. Given that the queue is a `SynchronousQueue` and the maximum pool
>>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200,
>>> its behavior becomes identical to that of non-pooled virtual threads.
>>>
>>> From my perspective, this example demonstrates that the benefits of
>>> pooling virtual threads outweigh those of creating a new virtual thread for
>>> every single task. In IO-bound scenarios, the virtual threads are directly
>>> reused rather than being recreated each time, and the memory footprint of
>>> virtual threads is far smaller than that of platform threads (which are
>>> controlled by the `-Xss` flag). Additionally, with pooled virtual threads,
>>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can
>>> also be reused, which further reduces overall memory usage—wouldn't you
>>> agree?
>>>
>>> Best Regards.
>>> Jianbin Chen, github-id: funky-eyes
>>>
>>> Robert Engels <rengels at ix.netcom.com> 于 2026年1月23日周五 21:52写道：
>>>
>>>> Because VT are so efficient to create, without any back pressure they
>>>> will all be created and running at essentially the same time (dramatically
>>>> raising the amount of memory in use) - versus with a pool of size N you
>>>> will have at most N running at once. In a REAL WORLD application there are
>>>> often external limiters (like number of tcp connections) that provide a
>>>> limit.
>>>>
>>>> If your tasks are purely cpu bound you should probably be using a
>>>> capped thread pool of platform threads as it makes no sense to have more
>>>> threads than available cores.
>>>>
>>>>
>>>>
>>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>>>
>>>> 
>>>> The question is why I need to use a semaphore to control the number of
>>>> concurrently running tasks. In my particular example, the goal is simply to
>>>> keep the concurrency level the same across different thread pool
>>>> implementations so I can fairly compare which one completes all the tasks
>>>> faster. This isn't solely about memory consumption—purely from a
>>>> **performance** perspective (e.g., total throughput or wall-clock time to
>>>> finish the workload), the same number of concurrent tasks completes
>>>> noticeably faster when using pooled virtual threads.
>>>>
>>>> My email probably didn't explain this clearly enough. In reality, I
>>>> have two main questions:
>>>>
>>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g.,
>>>> to hold expensive reusable objects like connections, formatters, or
>>>> parsers), is switching to a **pooled virtual thread executor** the only
>>>> viable solution—assuming we cannot modify the third-party library code?
>>>>
>>>> 2. When running the exact same number of concurrent tasks, pooled
>>>> virtual threads deliver better performance.
>>>>
>>>> Both questions point toward the same conclusion: for an application
>>>> originally built around a traditional platform thread pool, after upgrading
>>>> to JDK 21/25, moving to a **pooled virtual threads** approach is generally
>>>> superior to simply using non-pooled (unbounded) virtual threads.
>>>>
>>>> If any part of this reasoning or conclusion is mistaken, I would really
>>>> appreciate being corrected — thank you very much in advance for any
>>>> feedback or different experiences you can share!
>>>>
>>>> Best Regards.
>>>> Jianbin Chen, github-id: funky-eyes
>>>>
>>>> robert engels <robaho at me.com> 于 2026年1月23日周五 20:58写道：
>>>>
>>>>> Exactly, this is your problem. The total number of tasks will all be
>>>>> running at once in the thread per task model.
>>>>>
>>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>>>>
>>>>> 
>>>>> Hi Robert,
>>>>>
>>>>> Thanks you, but I'm a bit confused. In the example above, I only set
>>>>> the core pool size to 200 virtual threads, but for the specific test case
>>>>> we’re talking about, the concurrency isn’t actually being limited by the
>>>>> pool size at all. Since the maximum thread count is Integer.MAX_VALUE and
>>>>> it’s using a SynchronousQueue, tasks are handed off immediately and a new
>>>>> thread gets created to run them right away anyway.
>>>>>
>>>>> Best Regards.
>>>>> Jianbin Chen, github-id: funky-eyes
>>>>>
>>>>> robert engels <robaho at me.com> 于 2026年1月23日周五 20:28写道：
>>>>>
>>>>>> Try using a semaphore to limit the maximum number of tasks in
>>>>>> progress at anyone time - that is what is causing your memory spike. Think
>>>>>> of it this way since VT threads are so cheap to create - you are
>>>>>> essentially creating them all at once - making the working set size equally
>>>>>> to the maximum.  So you have N * WSS, where as in the other you have
>>>>>> POOLSIZE * WSS.
>>>>>>
>>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <jianbin at apache.org> wrote:
>>>>>>
>>>>>> 
>>>>>> Hi Alan,
>>>>>>
>>>>>> Thanks for your reply and for mentioning JEP 444.
>>>>>> I’ve gone through the guidance in JEP 444 and have some understanding
>>>>>> of it — which is exactly why I’m feeling a bit puzzled in practice and
>>>>>> would really like to hear your thoughts.
>>>>>>
>>>>>> Background — ThreadLocal example (Aerospike)
>>>>>> ```java
>>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new
>>>>>> ThreadLocal<byte[]>() {
>>>>>>     @Override
>>>>>>     protected byte[] initialValue() {
>>>>>>         return new byte[DefaultBufferSize];
>>>>>>     }
>>>>>> };
>>>>>> ```
>>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new
>>>>>> thread is created and stores it in a ThreadLocal for per-thread caching.
>>>>>>
>>>>>> My concern
>>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[]
>>>>>> instances are effectively reused because threads are long-lived and pooled.
>>>>>> - If we switch to creating a brand-new virtual thread per task (no
>>>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], which
>>>>>> leads to many short-lived 8KB allocations.
>>>>>> - That raises allocation rate and GC pressure (despite collectors
>>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads are
>>>>>> ephemeral.
>>>>>>
>>>>>> So my question is: for applications originally designed around
>>>>>> platform-thread pools, wouldn’t partially pooling virtual threads be
>>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I keep a
>>>>>> pool of 200 virtual threads, then when load exceeds that core size, a
>>>>>> SynchronousQueue will naturally cause new virtual threads to be created on
>>>>>> demand. This seems to preserve the behavior that ThreadLocal-based
>>>>>> libraries expect, without losing the ability to expand under spikes. Since
>>>>>> virtual threads are very lightweight, pooling a reasonable number (e.g.,
>>>>>> 200) seems to have negligible memory downside while retaining ThreadLocal
>>>>>> cache effectiveness.
>>>>>>
>>>>>> Empirical test I ran
>>>>>> (I ran a microbenchmark comparing an unpooled per-task virtual-thread
>>>>>> executor and a ThreadPoolExecutor that keeps 200 core virtual threads.)
>>>>>>
>>>>>> ```java
>>>>>> public static void main(String[] args) throws InterruptedException {
>>>>>>     Executor executor =
>>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
>>>>>> 1).factory());
>>>>>>     Executor executor2 = new ThreadPoolExecutor(
>>>>>>         200,
>>>>>>         Integer.MAX_VALUE,
>>>>>>         0L,
>>>>>>         java.util.concurrent.TimeUnit.SECONDS,
>>>>>>         new SynchronousQueue<>(),
>>>>>>         Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>>>>>     );
>>>>>>
>>>>>>     // Warm-up
>>>>>>     for (int i = 0; i < 10100; i++) {
>>>>>>         executor.execute(() -> {
>>>>>>             // simulate I/O wait
>>>>>>             try { Thread.sleep(100); } catch (InterruptedException e)
>>>>>> { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>         executor2.execute(() -> {
>>>>>>             // simulate I/O wait
>>>>>>             try { Thread.sleep(100); } catch (InterruptedException e)
>>>>>> { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>     }
>>>>>>
>>>>>>     // Ensure JIT + warm-up complete
>>>>>>     Thread.sleep(5000);
>>>>>>
>>>>>>     long start = System.currentTimeMillis();
>>>>>>     CountDownLatch countDownLatch = new CountDownLatch(50000);
>>>>>>     for (int i = 0; i < 50000; i++) {
>>>>>>         executor.execute(() -> {
>>>>>>             try { Thread.sleep(100); countDownLatch.countDown(); }
>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>     }
>>>>>>     countDownLatch.await();
>>>>>>     System.out.println("thread time: " + (System.currentTimeMillis()
>>>>>> - start) + " ms");
>>>>>>
>>>>>>     start = System.currentTimeMillis();
>>>>>>     CountDownLatch countDownLatch2 = new CountDownLatch(50000);
>>>>>>     for (int i = 0; i < 50000; i++) {
>>>>>>         executor2.execute(() -> {
>>>>>>             try { Thread.sleep(100); countDownLatch2.countDown(); }
>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>>>         });
>>>>>>     }
>>>>>>     countDownLatch.await();
>>>>>>     System.out.println("thread pool time: " +
>>>>>> (System.currentTimeMillis() - start) + " ms");
>>>>>> }
>>>>>> ```
>>>>>>
>>>>>> Result summary
>>>>>> - In my runs, the pooled virtual-thread executor (executor2)
>>>>>> performed better than the unpooled per-task virtual-thread executor.
>>>>>> - Even when I increased load by 10x or 100x, the pooled
>>>>>> virtual-thread executor still showed better performance.
>>>>>> - In realistic workloads, it seems pooling some virtual threads
>>>>>> reduces allocation/GC overhead and improves throughput compared to strictly
>>>>>> unpooled virtual threads.
>>>>>>
>>>>>> Final thought / request for feedback
>>>>>> - From my perspective, for systems originally tuned for
>>>>>> platform-thread pools, partially pooling virtual threads seems to have no
>>>>>> obvious downside and can restore ThreadLocal cache effectiveness used by
>>>>>> many third-party libraries.
>>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread
>>>>>> semantics, or ThreadLocal behavior, please point out what I’m missing. I’d
>>>>>> appreciate your guidance.
>>>>>>
>>>>>> Best Regards.
>>>>>> Jianbin Chen, github-id: funky-eyes
>>>>>>
>>>>>> Alan Bateman <alan.bateman at oracle.com> 于 2026年1月23日周五 17:27写道：
>>>>>>
>>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote:
>>>>>>> > :
>>>>>>> >
>>>>>>> > So my question is:
>>>>>>> >
>>>>>>> > **In scenarios where third-party libraries heavily rely on
>>>>>>> ThreadLocal
>>>>>>> > for caching / buffering (and we cannot change those libraries to
>>>>>>> use
>>>>>>> > object pools instead), is explicitly pooling virtual threads
>>>>>>> (using a
>>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a
>>>>>>> > recommended / acceptable workaround?**
>>>>>>> >
>>>>>>> > Or are there better / more idiomatic ways to handle this kind of
>>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when
>>>>>>> > migrating to virtual threads?
>>>>>>> >
>>>>>>> > I have already opened a related discussion in the Dubbo project
>>>>>>> (since
>>>>>>> > Dubbo is one of the libraries affected in our stack):
>>>>>>> >
>>>>>>> > https://github.com/apache/dubbo/issues/16042
>>>>>>> >
>>>>>>> > Would love to hear your thoughts — especially from people who have
>>>>>>> > experience running large-scale virtual-thread-based services with
>>>>>>> > mixed third-party dependencies.
>>>>>>> >
>>>>>>>
>>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual
>>>>>>> threads
>>>>>>> and to avoid caching costing resources in thread locals. Virtual
>>>>>>> threads
>>>>>>> support thread locals of course but that is not useful when some
>>>>>>> library
>>>>>>> is looking to share a costly resource between tasks that run on the
>>>>>>> same
>>>>>>> thread in a thread pool.
>>>>>>>
>>>>>>> I don't know anything about Aerospike but working with the
>>>>>>> maintainers
>>>>>>> of that library to re-work its buffer management seems like the
>>>>>>> right
>>>>>>> course of action here. Your mail says "byte buffers". If this is
>>>>>>> ByteBuffer it might be that they are caching direct buffers as they
>>>>>>> are
>>>>>>> expensive to create (and managed by the GC). Maybe they could look
>>>>>>> at
>>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory
>>>>>>> segment) and allocate from an arena that better matches the
>>>>>>> lifecycle.
>>>>>>>
>>>>>>> Hopefully others will share their experiences with migration as it
>>>>>>> is
>>>>>>> indeed challenging to migrate code developed for thread pools to
>>>>>>> work
>>>>>>> efficiently on virtual threads where there is 1-1 relationship
>>>>>>> between
>>>>>>> the task to execute and the thread.
>>>>>>>
>>>>>>> -Alan
>>>>>>>
>>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20260123/2144bc96/attachment-0001.htm>