Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)
Jianbin Chen
jianbin at apache.org
Fri Jan 23 10:12:31 UTC 2026
Hi Alan,
Thanks for your reply and for mentioning JEP 444.
I’ve gone through the guidance in JEP 444 and have some understanding of it
— which is exactly why I’m feeling a bit puzzled in practice and would
really like to hear your thoughts.
Background — ThreadLocal example (Aerospike)
```java
private static final ThreadLocal<byte[]> BufferThreadLocal = new
ThreadLocal<byte[]>() {
@Override
protected byte[] initialValue() {
return new byte[DefaultBufferSize];
}
};
```
This Aerospike code allocates a default 8KB byte[] whenever a new thread is
created and stores it in a ThreadLocal for per-thread caching.
My concern
- With a traditional platform-thread pool, those ThreadLocal byte[]
instances are effectively reused because threads are long-lived and pooled.
- If we switch to creating a brand-new virtual thread per task (no
pooling), each virtual thread gets its own fresh ThreadLocal byte[], which
leads to many short-lived 8KB allocations.
- That raises allocation rate and GC pressure (despite collectors like
ZGC), because ThreadLocal caches aren’t reused when threads are ephemeral.
So my question is: for applications originally designed around
platform-thread pools, wouldn’t partially pooling virtual threads be
beneficial? For example, Tomcat’s default max threads is 200 — if I keep a
pool of 200 virtual threads, then when load exceeds that core size, a
SynchronousQueue will naturally cause new virtual threads to be created on
demand. This seems to preserve the behavior that ThreadLocal-based
libraries expect, without losing the ability to expand under spikes. Since
virtual threads are very lightweight, pooling a reasonable number (e.g.,
200) seems to have negligible memory downside while retaining ThreadLocal
cache effectiveness.
Empirical test I ran
(I ran a microbenchmark comparing an unpooled per-task virtual-thread
executor and a ThreadPoolExecutor that keeps 200 core virtual threads.)
```java
public static void main(String[] args) throws InterruptedException {
Executor executor =
Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
1).factory());
Executor executor2 = new ThreadPoolExecutor(
200,
Integer.MAX_VALUE,
0L,
java.util.concurrent.TimeUnit.SECONDS,
new SynchronousQueue<>(),
Thread.ofVirtual().name("test-threadpool-", 1).factory()
);
// Warm-up
for (int i = 0; i < 10100; i++) {
executor.execute(() -> {
// simulate I/O wait
try { Thread.sleep(100); } catch (InterruptedException e) {
throw new RuntimeException(e); }
});
executor2.execute(() -> {
// simulate I/O wait
try { Thread.sleep(100); } catch (InterruptedException e) {
throw new RuntimeException(e); }
});
}
// Ensure JIT + warm-up complete
Thread.sleep(5000);
long start = System.currentTimeMillis();
CountDownLatch countDownLatch = new CountDownLatch(50000);
for (int i = 0; i < 50000; i++) {
executor.execute(() -> {
try { Thread.sleep(100); countDownLatch.countDown(); } catch
(InterruptedException e) { throw new RuntimeException(e); }
});
}
countDownLatch.await();
System.out.println("thread time: " + (System.currentTimeMillis() -
start) + " ms");
start = System.currentTimeMillis();
CountDownLatch countDownLatch2 = new CountDownLatch(50000);
for (int i = 0; i < 50000; i++) {
executor2.execute(() -> {
try { Thread.sleep(100); countDownLatch2.countDown(); } catch
(InterruptedException e) { throw new RuntimeException(e); }
});
}
countDownLatch.await();
System.out.println("thread pool time: " + (System.currentTimeMillis() -
start) + " ms");
}
```
Result summary
- In my runs, the pooled virtual-thread executor (executor2) performed
better than the unpooled per-task virtual-thread executor.
- Even when I increased load by 10x or 100x, the pooled virtual-thread
executor still showed better performance.
- In realistic workloads, it seems pooling some virtual threads reduces
allocation/GC overhead and improves throughput compared to strictly
unpooled virtual threads.
Final thought / request for feedback
- From my perspective, for systems originally tuned for platform-thread
pools, partially pooling virtual threads seems to have no obvious downside
and can restore ThreadLocal cache effectiveness used by many third-party
libraries.
- If I’ve misunderstood JEP 444 recommendations, virtual-thread semantics,
or ThreadLocal behavior, please point out what I’m missing. I’d appreciate
your guidance.
Best Regards.
Jianbin Chen, github-id: funky-eyes
Alan Bateman <alan.bateman at oracle.com> 于 2026年1月23日周五 17:27写道:
> On 23/01/2026 07:30, Jianbin Chen wrote:
> > :
> >
> > So my question is:
> >
> > **In scenarios where third-party libraries heavily rely on ThreadLocal
> > for caching / buffering (and we cannot change those libraries to use
> > object pools instead), is explicitly pooling virtual threads (using a
> > ThreadPoolExecutor with virtual thread factory) considered a
> > recommended / acceptable workaround?**
> >
> > Or are there better / more idiomatic ways to handle this kind of
> > compatibility issue with legacy ThreadLocal-based libraries when
> > migrating to virtual threads?
> >
> > I have already opened a related discussion in the Dubbo project (since
> > Dubbo is one of the libraries affected in our stack):
> >
> > https://github.com/apache/dubbo/issues/16042
> >
> > Would love to hear your thoughts — especially from people who have
> > experience running large-scale virtual-thread-based services with
> > mixed third-party dependencies.
> >
>
> The guidelines that we put in JEP 444 [1] is to not pool virtual threads
> and to avoid caching costing resources in thread locals. Virtual threads
> support thread locals of course but that is not useful when some library
> is looking to share a costly resource between tasks that run on the same
> thread in a thread pool.
>
> I don't know anything about Aerospike but working with the maintainers
> of that library to re-work its buffer management seems like the right
> course of action here. Your mail says "byte buffers". If this is
> ByteBuffer it might be that they are caching direct buffers as they are
> expensive to create (and managed by the GC). Maybe they could look at
> using MemorySegment (it's easy to get a ByteBuffer view of a memory
> segment) and allocate from an arena that better matches the lifecycle.
>
> Hopefully others will share their experiences with migration as it is
> indeed challenging to migrate code developed for thread pools to work
> efficiently on virtual threads where there is 1-1 relationship between
> the task to execute and the thread.
>
> -Alan
>
> [1] https://openjdk.org/jeps/444#Thread-local-variables
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20260123/f57c2b4f/attachment-0001.htm>
More information about the loom-dev
mailing list