Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Jianbin Chen jianbin at apache.org
Sat Jan 24 05:55:39 UTC 2026


Hi Francesco,

I modified my example as follows:

```java
public static void main(String[] args) throws InterruptedException {
    Executor executor = Executors.newVirtualThreadPerTaskExecutor();
    Executor executor2 = new ThreadPoolExecutor(200, Integer.MAX_VALUE, 0L,
java.util.concurrent.TimeUnit.SECONDS,
        new SynchronousQueue<>(), Thread.ofVirtual().factory());
    for (int i = 0; i < 10100; i++) {
        executor.execute(() -> {
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        });
        executor2.execute(() -> {
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        });
    }
    Thread.sleep(5000);
    long start = System.currentTimeMillis();
    CountDownLatch countDownLatch = new CountDownLatch(5000000);
    for (int i = 0; i < 5000000; i++) {
        executor.execute(() -> {
            try {
                Thread.sleep(100);
                countDownLatch.countDown();
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        });
    }
    countDownLatch.await();
    System.out.println("thread time: " + (System.currentTimeMillis() -
start) + " ms");
    start = System.currentTimeMillis();
    CountDownLatch countDownLatch2 = new CountDownLatch(5000000);
    for (int i = 0; i < 5000000; i++) {
        executor2.execute(() -> {
            try {
                Thread.sleep(100);
                countDownLatch2.countDown();
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        });
    }
    countDownLatch.await();
    System.out.println("thread pool time: " + (System.currentTimeMillis() -
start) + " ms");
}
```

I constructed the Executor directly with
Executors.newVirtualThreadPerTaskExecutor();
however, the run results still show that the pooled virtual‑thread behavior
outperforms the non‑pooled virtual threads.


Francesco Nigro <nigro.fra at gmail.com> 于2026年1月23日周五 23:39写道:

> I would say, yes:
>
> https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/java/lang/ThreadBuilders.java#L317
> unless the fix will be backported - surely @Andrew Haley
> <aph-open at littlepinkcloud.com> or @Alan Bateman <alan.bateman at oracle.com>
>  knows
>
> Il giorno ven 23 gen 2026 alle ore 16:32 Jianbin Chen <jianbin at apache.org>
> ha scritto:
>
> > Hi Francesco,
> >
> > I'd like to know if there's a similar issue in JDK 21?
> >
> > Best Regards.
> > Jianbin Chen, github-id: funky-eyes
> >
> > Francesco Nigro <nigro.fra at gmail.com> 于 2026年1月23日周五 23:14写道:
> >
> >> In the original code snippet I see named (with a counter) VThreads, so,
> >> be aware of https://bugs.openjdk.org/browse/JDK-8372410
> >>
> >> Il giorno ven 23 gen 2026 alle ore 15:52 Jianbin Chen <
> jianbin at apache.org>
> >> ha scritto:
> >>
> >>> I'm sorry — I forgot to mention the machine I used for the load test.
> My
> >>> server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m.
> Under my
> >>> test load (about 20,000 QPS), with non‑pooled virtual threads you
> generate
> >>> at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just
> from
> >>> that 8 KB buffer; that doesn't include other object allocations. With a
> >>> 2880 MB heap this allocation rate already forces very frequent GC, and
> >>> frequent GC raises CPU usage, which in turn significantly increases
> average
> >>> response time and p99 / p999 latency.
> >>>
> >>> Pooling is usually introduced to solve performance issues — object
> pools
> >>> and connection pools exist to quickly reuse cached resources and
> improve
> >>> performance. So pooling virtual threads also yields obvious benefits,
> >>> especially for memory‑constrained, I/O‑bound applications (gateways,
> >>> proxies, etc.) that are sensitive to latency.
> >>>
> >>> Best Regards.
> >>> Jianbin Chen, github-id: funky-eyes
> >>>
> >>> Robert Engels <rengels at ix.netcom.com> 于 2026年1月23日周五 22:20写道:
> >>>
> >>>> I understand. I was trying explain how you can not use thread locals
> >>>> and maintain the performance. It’s unlikely allocating a 8k buffer is
> a
> >>>> performance bottleneck in a real program if the task is not cpu bound
> >>>> (depending on the granularity you make your tasks) - but 2M tasks
> running
> >>>> simultaneously would require 16gb of memory not including the stack.
> >>>>
> >>>> You cannot simply use the thread per task model without an
> >>>> understanding of the cpu, IO, and memory footprints of your tasks and
> then
> >>>> configure appropriately.
> >>>>
> >>>> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <jianbin at apache.org> wrote:
> >>>>
> >>>> 
> >>>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough.
> >>>> Here's the code in question:
> >>>>
> >>>> ```java
> >>>> Executor executor2 = new ThreadPoolExecutor(
> >>>>     200,
> >>>>     Integer.MAX_VALUE,
> >>>>     0L,
> >>>>     java.util.concurrent.TimeUnit.SECONDS,
> >>>>     new SynchronousQueue<>(),
> >>>>     Thread.ofVirtual().name("test-threadpool-", 1).factory()
> >>>> );
> >>>> ```
> >>>>
> >>>> In this example, the pooled virtual threads don't implement any
> >>>> backpressure mechanism; they simply maintain a core pool of 200
> virtual
> >>>> threads. Given that the queue is a `SynchronousQueue` and the maximum
> pool
> >>>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed
> 200,
> >>>> its behavior becomes identical to that of non-pooled virtual threads.
> >>>>
> >>>> From my perspective, this example demonstrates that the benefits of
> >>>> pooling virtual threads outweigh those of creating a new virtual
> thread for
> >>>> every single task. In IO-bound scenarios, the virtual threads are
> directly
> >>>> reused rather than being recreated each time, and the memory
> footprint of
> >>>> virtual threads is far smaller than that of platform threads (which
> are
> >>>> controlled by the `-Xss` flag). Additionally, with pooled virtual
> threads,
> >>>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`)
> can
> >>>> also be reused, which further reduces overall memory usage—wouldn't
> you
> >>>> agree?
> >>>>
> >>>> Best Regards.
> >>>> Jianbin Chen, github-id: funky-eyes
> >>>>
> >>>> Robert Engels <rengels at ix.netcom.com> 于 2026年1月23日周五 21:52写道:
> >>>>
> >>>>> Because VT are so efficient to create, without any back pressure they
> >>>>> will all be created and running at essentially the same time
> (dramatically
> >>>>> raising the amount of memory in use) - versus with a pool of size N
> you
> >>>>> will have at most N running at once. In a REAL WORLD application
> there are
> >>>>> often external limiters (like number of tcp connections) that
> provide a
> >>>>> limit.
> >>>>>
> >>>>> If your tasks are purely cpu bound you should probably be using a
> >>>>> capped thread pool of platform threads as it makes no sense to have
> more
> >>>>> threads than available cores.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <jianbin at apache.org>
> wrote:
> >>>>>
> >>>>> 
> >>>>> The question is why I need to use a semaphore to control the number
> of
> >>>>> concurrently running tasks. In my particular example, the goal is
> simply to
> >>>>> keep the concurrency level the same across different thread pool
> >>>>> implementations so I can fairly compare which one completes all the
> tasks
> >>>>> faster. This isn't solely about memory consumption—purely from a
> >>>>> **performance** perspective (e.g., total throughput or wall-clock
> time to
> >>>>> finish the workload), the same number of concurrent tasks completes
> >>>>> noticeably faster when using pooled virtual threads.
> >>>>>
> >>>>> My email probably didn't explain this clearly enough. In reality, I
> >>>>> have two main questions:
> >>>>>
> >>>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool
> >>>>> (e.g., to hold expensive reusable objects like connections,
> formatters, or
> >>>>> parsers), is switching to a **pooled virtual thread executor** the
> only
> >>>>> viable solution—assuming we cannot modify the third-party library
> code?
> >>>>>
> >>>>> 2. When running the exact same number of concurrent tasks, pooled
> >>>>> virtual threads deliver better performance.
> >>>>>
> >>>>> Both questions point toward the same conclusion: for an application
> >>>>> originally built around a traditional platform thread pool, after
> upgrading
> >>>>> to JDK 21/25, moving to a **pooled virtual threads** approach is
> generally
> >>>>> superior to simply using non-pooled (unbounded) virtual threads.
> >>>>>
> >>>>> If any part of this reasoning or conclusion is mistaken, I would
> >>>>> really appreciate being corrected — thank you very much in advance
> for any
> >>>>> feedback or different experiences you can share!
> >>>>>
> >>>>> Best Regards.
> >>>>> Jianbin Chen, github-id: funky-eyes
> >>>>>
> >>>>> robert engels <robaho at me.com> 于 2026年1月23日周五 20:58写道:
> >>>>>
> >>>>>> Exactly, this is your problem. The total number of tasks will all be
> >>>>>> running at once in the thread per task model.
> >>>>>>
> >>>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <jianbin at apache.org>
> wrote:
> >>>>>>
> >>>>>> 
> >>>>>> Hi Robert,
> >>>>>>
> >>>>>> Thanks you, but I'm a bit confused. In the example above, I only set
> >>>>>> the core pool size to 200 virtual threads, but for the specific
> test case
> >>>>>> we’re talking about, the concurrency isn’t actually being limited
> by the
> >>>>>> pool size at all. Since the maximum thread count is
> Integer.MAX_VALUE and
> >>>>>> it’s using a SynchronousQueue, tasks are handed off immediately and
> a new
> >>>>>> thread gets created to run them right away anyway.
> >>>>>>
> >>>>>> Best Regards.
> >>>>>> Jianbin Chen, github-id: funky-eyes
> >>>>>>
> >>>>>> robert engels <robaho at me.com> 于 2026年1月23日周五 20:28写道:
> >>>>>>
> >>>>>>> Try using a semaphore to limit the maximum number of tasks in
> >>>>>>> progress at anyone time - that is what is causing your memory
> spike. Think
> >>>>>>> of it this way since VT threads are so cheap to create - you are
> >>>>>>> essentially creating them all at once - making the working set
> size equally
> >>>>>>> to the maximum.  So you have N * WSS, where as in the other you
> have
> >>>>>>> POOLSIZE * WSS.
> >>>>>>>
> >>>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <jianbin at apache.org>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> 
> >>>>>>> Hi Alan,
> >>>>>>>
> >>>>>>> Thanks for your reply and for mentioning JEP 444.
> >>>>>>> I’ve gone through the guidance in JEP 444 and have some
> >>>>>>> understanding of it — which is exactly why I’m feeling a bit
> puzzled in
> >>>>>>> practice and would really like to hear your thoughts.
> >>>>>>>
> >>>>>>> Background — ThreadLocal example (Aerospike)
> >>>>>>> ```java
> >>>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new
> >>>>>>> ThreadLocal<byte[]>() {
> >>>>>>>     @Override
> >>>>>>>     protected byte[] initialValue() {
> >>>>>>>         return new byte[DefaultBufferSize];
> >>>>>>>     }
> >>>>>>> };
> >>>>>>> ```
> >>>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new
> >>>>>>> thread is created and stores it in a ThreadLocal for per-thread
> caching.
> >>>>>>>
> >>>>>>> My concern
> >>>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[]
> >>>>>>> instances are effectively reused because threads are long-lived
> and pooled.
> >>>>>>> - If we switch to creating a brand-new virtual thread per task (no
> >>>>>>> pooling), each virtual thread gets its own fresh ThreadLocal
> byte[], which
> >>>>>>> leads to many short-lived 8KB allocations.
> >>>>>>> - That raises allocation rate and GC pressure (despite collectors
> >>>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads
> are
> >>>>>>> ephemeral.
> >>>>>>>
> >>>>>>> So my question is: for applications originally designed around
> >>>>>>> platform-thread pools, wouldn’t partially pooling virtual threads
> be
> >>>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if
> I keep a
> >>>>>>> pool of 200 virtual threads, then when load exceeds that core
> size, a
> >>>>>>> SynchronousQueue will naturally cause new virtual threads to be
> created on
> >>>>>>> demand. This seems to preserve the behavior that ThreadLocal-based
> >>>>>>> libraries expect, without losing the ability to expand under
> spikes. Since
> >>>>>>> virtual threads are very lightweight, pooling a reasonable number
> (e.g.,
> >>>>>>> 200) seems to have negligible memory downside while retaining
> ThreadLocal
> >>>>>>> cache effectiveness.
> >>>>>>>
> >>>>>>> Empirical test I ran
> >>>>>>> (I ran a microbenchmark comparing an unpooled per-task
> >>>>>>> virtual-thread executor and a ThreadPoolExecutor that keeps 200
> core
> >>>>>>> virtual threads.)
> >>>>>>>
> >>>>>>> ```java
> >>>>>>> public static void main(String[] args) throws InterruptedException
> {
> >>>>>>>     Executor executor =
> >>>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
> >>>>>>> 1).factory());
> >>>>>>>     Executor executor2 = new ThreadPoolExecutor(
> >>>>>>>         200,
> >>>>>>>         Integer.MAX_VALUE,
> >>>>>>>         0L,
> >>>>>>>         java.util.concurrent.TimeUnit.SECONDS,
> >>>>>>>         new SynchronousQueue<>(),
> >>>>>>>         Thread.ofVirtual().name("test-threadpool-", 1).factory()
> >>>>>>>     );
> >>>>>>>
> >>>>>>>     // Warm-up
> >>>>>>>     for (int i = 0; i < 10100; i++) {
> >>>>>>>         executor.execute(() -> {
> >>>>>>>             // simulate I/O wait
> >>>>>>>             try { Thread.sleep(100); } catch (InterruptedException
> >>>>>>> e) { throw new RuntimeException(e); }
> >>>>>>>         });
> >>>>>>>         executor2.execute(() -> {
> >>>>>>>             // simulate I/O wait
> >>>>>>>             try { Thread.sleep(100); } catch (InterruptedException
> >>>>>>> e) { throw new RuntimeException(e); }
> >>>>>>>         });
> >>>>>>>     }
> >>>>>>>
> >>>>>>>     // Ensure JIT + warm-up complete
> >>>>>>>     Thread.sleep(5000);
> >>>>>>>
> >>>>>>>     long start = System.currentTimeMillis();
> >>>>>>>     CountDownLatch countDownLatch = new CountDownLatch(50000);
> >>>>>>>     for (int i = 0; i < 50000; i++) {
> >>>>>>>         executor.execute(() -> {
> >>>>>>>             try { Thread.sleep(100); countDownLatch.countDown(); }
> >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
> >>>>>>>         });
> >>>>>>>     }
> >>>>>>>     countDownLatch.await();
> >>>>>>>     System.out.println("thread time: " +
> (System.currentTimeMillis()
> >>>>>>> - start) + " ms");
> >>>>>>>
> >>>>>>>     start = System.currentTimeMillis();
> >>>>>>>     CountDownLatch countDownLatch2 = new CountDownLatch(50000);
> >>>>>>>     for (int i = 0; i < 50000; i++) {
> >>>>>>>         executor2.execute(() -> {
> >>>>>>>             try { Thread.sleep(100); countDownLatch2.countDown(); }
> >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
> >>>>>>>         });
> >>>>>>>     }
> >>>>>>>     countDownLatch.await();
> >>>>>>>     System.out.println("thread pool time: " +
> >>>>>>> (System.currentTimeMillis() - start) + " ms");
> >>>>>>> }
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Result summary
> >>>>>>> - In my runs, the pooled virtual-thread executor (executor2)
> >>>>>>> performed better than the unpooled per-task virtual-thread
> executor.
> >>>>>>> - Even when I increased load by 10x or 100x, the pooled
> >>>>>>> virtual-thread executor still showed better performance.
> >>>>>>> - In realistic workloads, it seems pooling some virtual threads
> >>>>>>> reduces allocation/GC overhead and improves throughput compared to
> strictly
> >>>>>>> unpooled virtual threads.
> >>>>>>>
> >>>>>>> Final thought / request for feedback
> >>>>>>> - From my perspective, for systems originally tuned for
> >>>>>>> platform-thread pools, partially pooling virtual threads seems to
> have no
> >>>>>>> obvious downside and can restore ThreadLocal cache effectiveness
> used by
> >>>>>>> many third-party libraries.
> >>>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread
> >>>>>>> semantics, or ThreadLocal behavior, please point out what I’m
> missing. I’d
> >>>>>>> appreciate your guidance.
> >>>>>>>
> >>>>>>> Best Regards.
> >>>>>>> Jianbin Chen, github-id: funky-eyes
> >>>>>>>
> >>>>>>> Alan Bateman <alan.bateman at oracle.com> 于 2026年1月23日周五 17:27写道:
> >>>>>>>
> >>>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote:
> >>>>>>>> > :
> >>>>>>>> >
> >>>>>>>> > So my question is:
> >>>>>>>> >
> >>>>>>>> > **In scenarios where third-party libraries heavily rely on
> >>>>>>>> ThreadLocal
> >>>>>>>> > for caching / buffering (and we cannot change those libraries to
> >>>>>>>> use
> >>>>>>>> > object pools instead), is explicitly pooling virtual threads
> >>>>>>>> (using a
> >>>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a
> >>>>>>>> > recommended / acceptable workaround?**
> >>>>>>>> >
> >>>>>>>> > Or are there better / more idiomatic ways to handle this kind of
> >>>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when
> >>>>>>>> > migrating to virtual threads?
> >>>>>>>> >
> >>>>>>>> > I have already opened a related discussion in the Dubbo project
> >>>>>>>> (since
> >>>>>>>> > Dubbo is one of the libraries affected in our stack):
> >>>>>>>> >
> >>>>>>>> > https://github.com/apache/dubbo/issues/16042
> >>>>>>>> >
> >>>>>>>> > Would love to hear your thoughts — especially from people who
> >>>>>>>> have
> >>>>>>>> > experience running large-scale virtual-thread-based services
> with
> >>>>>>>> > mixed third-party dependencies.
> >>>>>>>> >
> >>>>>>>>
> >>>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual
> >>>>>>>> threads
> >>>>>>>> and to avoid caching costing resources in thread locals. Virtual
> >>>>>>>> threads
> >>>>>>>> support thread locals of course but that is not useful when some
> >>>>>>>> library
> >>>>>>>> is looking to share a costly resource between tasks that run on
> the
> >>>>>>>> same
> >>>>>>>> thread in a thread pool.
> >>>>>>>>
> >>>>>>>> I don't know anything about Aerospike but working with the
> >>>>>>>> maintainers
> >>>>>>>> of that library to re-work its buffer management seems like the
> >>>>>>>> right
> >>>>>>>> course of action here. Your mail says "byte buffers". If this is
> >>>>>>>> ByteBuffer it might be that they are caching direct buffers as
> they
> >>>>>>>> are
> >>>>>>>> expensive to create (and managed by the GC). Maybe they could look
> >>>>>>>> at
> >>>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a
> memory
> >>>>>>>> segment) and allocate from an arena that better matches the
> >>>>>>>> lifecycle.
> >>>>>>>>
> >>>>>>>> Hopefully others will share their experiences with migration as it
> >>>>>>>> is
> >>>>>>>> indeed challenging to migrate code developed for thread pools to
> >>>>>>>> work
> >>>>>>>> efficiently on virtual threads where there is 1-1 relationship
> >>>>>>>> between
> >>>>>>>> the task to execute and the thread.
> >>>>>>>>
> >>>>>>>> -Alan
> >>>>>>>>
> >>>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables
> >>>>>>>>
> >>>>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20260124/56ec26f8/attachment-0001.htm>


More information about the loom-dev mailing list