<html class="apple-mail-supports-explicit-dark-mode"><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr"></div><div dir="ltr">I understand. I was trying explain how you can not use thread locals and maintain the performance. It’s unlikely allocating a 8k buffer is a performance bottleneck in a real program if the task is not cpu bound (depending on the granularity you make your tasks) - but 2M tasks running simultaneously would require 16gb of memory not including the stack. </div><div dir="ltr"><br></div><div dir="ltr">You cannot simply use the thread per task model without an understanding of the cpu, IO, and memory footprints of your tasks and then configure appropriately. </div><div dir="ltr"><br><blockquote type="cite">On Jan 23, 2026, at 8:10 AM, Jianbin Chen <jianbin@apache.org> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="auto">I'm sorry, Robert—perhaps I didn't explain my example clearly enough. Here's the code in question:<div dir="auto"><br></div><div dir="auto">```java</div><div dir="auto">Executor executor2 = new ThreadPoolExecutor(</div><div dir="auto"> 200,</div><div dir="auto"> Integer.MAX_VALUE,</div><div dir="auto"> 0L,</div><div dir="auto"> java.util.concurrent.TimeUnit.SECONDS,</div><div dir="auto"> new SynchronousQueue<>(),</div><div dir="auto"> Thread.ofVirtual().name("test-threadpool-", 1).factory()</div><div dir="auto">);</div><div dir="auto">```</div><div dir="auto"><br></div><div dir="auto">In this example, the pooled virtual threads don't implement any backpressure mechanism; they simply maintain a core pool of 200 virtual threads. Given that the queue is a `SynchronousQueue` and the maximum pool size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200, its behavior becomes identical to that of non-pooled virtual threads.</div><div dir="auto"><br></div><div dir="auto">From my perspective, this example demonstrates that the benefits of pooling virtual threads outweigh those of creating a new virtual thread for every single task. In IO-bound scenarios, the virtual threads are directly reused rather than being recreated each time, and the memory footprint of virtual threads is far smaller than that of platform threads (which are controlled by the `-Xss` flag). Additionally, with pooled virtual threads, the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can also be reused, which further reduces overall memory usage—wouldn't you agree?<br><br><div data-smartmail="gmail_signature" dir="auto">Best Regards.<br>Jianbin Chen, github-id: funky-eyes </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Robert Engels <<a href="mailto:rengels@ix.netcom.com">rengels@ix.netcom.com</a>> 于 2026年1月23日周五 21:52写道:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="ltr"></div><div dir="ltr">Because VT are so efficient to create, without any back pressure they will all be created and running at essentially the same time (dramatically raising the amount of memory in use) - versus with a pool of size N you will have at most N running at once. In a REAL WORLD application there are often external limiters (like number of tcp connections) that provide a limit. </div><div dir="ltr"><br></div><div dir="ltr">If your tasks are purely cpu bound you should probably be using a capped thread pool of platform threads as it makes no sense to have more threads than available cores. </div><div dir="ltr"><br></div><div dir="ltr"><br></div><div dir="ltr"><br><blockquote type="cite">On Jan 23, 2026, at 7:42 AM, Jianbin Chen <<a href="mailto:jianbin@apache.org" target="_blank" rel="noreferrer">jianbin@apache.org</a>> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="auto">The question is why I need to use a semaphore to control the number of concurrently running tasks. In my particular example, the goal is simply to keep the concurrency level the same across different thread pool implementations so I can fairly compare which one completes all the tasks faster. This isn't solely about memory consumption—purely from a **performance** perspective (e.g., total throughput or wall-clock time to finish the workload), the same number of concurrent tasks completes noticeably faster when using pooled virtual threads.<div dir="auto"><br></div><div dir="auto">My email probably didn't explain this clearly enough. In reality, I have two main questions:</div><div dir="auto"><br></div><div dir="auto">1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g., to hold expensive reusable objects like connections, formatters, or parsers), is switching to a **pooled virtual thread executor** the only viable solution—assuming we cannot modify the third-party library code?</div><div dir="auto"><br></div><div dir="auto">2. When running the exact same number of concurrent tasks, pooled virtual threads deliver better performance.</div><div dir="auto"><br></div><div dir="auto">Both questions point toward the same conclusion: for an application originally built around a traditional platform thread pool, after upgrading to JDK 21/25, moving to a **pooled virtual threads** approach is generally superior to simply using non-pooled (unbounded) virtual threads. </div><div dir="auto"><br></div><div dir="auto">If any part of this reasoning or conclusion is mistaken, I would really appreciate being corrected — thank you very much in advance for any feedback or different experiences you can share!<br><br><div data-smartmail="gmail_signature" dir="auto">Best Regards.<br>Jianbin Chen, github-id: funky-eyes </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">robert engels <<a href="mailto:robaho@me.com" target="_blank" rel="noreferrer">robaho@me.com</a>> 于 2026年1月23日周五 20:58写道:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="ltr"></div><div dir="ltr">Exactly, this is your problem. The total number of tasks will all be running at once in the thread per task model. </div><div dir="ltr"><br><blockquote type="cite">On Jan 23, 2026, at 6:49 AM, Jianbin Chen <<a href="mailto:jianbin@apache.org" rel="noreferrer noreferrer" target="_blank">jianbin@apache.org</a>> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="auto">Hi Robert,<div dir="auto"><br></div><div dir="auto">Thanks you, but I'm a bit confused. In the example above, I only set the core pool size to 200 virtual threads, but for the specific test case we’re talking about, the concurrency isn’t actually being limited by the pool size at all. Since the maximum thread count is Integer.MAX_VALUE and it’s using a SynchronousQueue, tasks are handed off immediately and a new thread gets created to run them right away anyway.<br><br><div data-smartmail="gmail_signature" dir="auto">Best Regards.<br>Jianbin Chen, github-id: funky-eyes </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">robert engels <<a href="mailto:robaho@me.com" rel="noreferrer noreferrer" target="_blank">robaho@me.com</a>> 于 2026年1月23日周五 20:28写道:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="ltr"></div><div dir="ltr">Try using a semaphore to limit the maximum number of tasks in progress at anyone time - that is what is causing your memory spike. Think of it this way since VT threads are so cheap to create - you are essentially creating them all at once - making the working set size equally to the maximum. So you have N * WSS, where as in the other you have POOLSIZE * WSS. </div><div dir="ltr"><br><div dir="ltr"></div><blockquote type="cite">On Jan 23, 2026, at 4:14 AM, Jianbin Chen <<a href="mailto:jianbin@apache.org" rel="noreferrer noreferrer noreferrer" target="_blank">jianbin@apache.org</a>> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="auto"><div>Hi Alan,<div dir="auto"><br></div><div dir="auto">Thanks for your reply and for mentioning JEP 444.</div><div dir="auto">I’ve gone through the guidance in JEP 444 and have some understanding of it — which is exactly why I’m feeling a bit puzzled in practice and would really like to hear your thoughts.</div><div dir="auto"><br></div><div dir="auto">Background — ThreadLocal example (Aerospike)</div><div dir="auto">```java</div><div dir="auto">private static final ThreadLocal<byte[]> BufferThreadLocal = new ThreadLocal<byte[]>() {</div><div dir="auto"> @Override</div><div dir="auto"> protected byte[] initialValue() {</div><div dir="auto"> return new byte[DefaultBufferSize];</div><div dir="auto"> }</div><div dir="auto">};</div><div dir="auto">```</div><div dir="auto">This Aerospike code allocates a default 8KB byte[] whenever a new thread is created and stores it in a ThreadLocal for per-thread caching.</div><div dir="auto"><br></div><div dir="auto">My concern</div><div dir="auto">- With a traditional platform-thread pool, those ThreadLocal byte[] instances are effectively reused because threads are long-lived and pooled.</div><div dir="auto">- If we switch to creating a brand-new virtual thread per task (no pooling), each virtual thread gets its own fresh ThreadLocal byte[], which leads to many short-lived 8KB allocations.</div><div dir="auto">- That raises allocation rate and GC pressure (despite collectors like ZGC), because ThreadLocal caches aren’t reused when threads are ephemeral.</div><div dir="auto"><br></div><div dir="auto">So my question is: for applications originally designed around platform-thread pools, wouldn’t partially pooling virtual threads be beneficial? For example, Tomcat’s default max threads is 200 — if I keep a pool of 200 virtual threads, then when load exceeds that core size, a SynchronousQueue will naturally cause new virtual threads to be created on demand. This seems to preserve the behavior that ThreadLocal-based libraries expect, without losing the ability to expand under spikes. Since virtual threads are very lightweight, pooling a reasonable number (e.g., 200) seems to have negligible memory downside while retaining ThreadLocal cache effectiveness.</div><div dir="auto"><br></div><div dir="auto">Empirical test I ran</div><div dir="auto">(I ran a microbenchmark comparing an unpooled per-task virtual-thread executor and a ThreadPoolExecutor that keeps 200 core virtual threads.)</div><div dir="auto"><br></div><div dir="auto">```java</div><div dir="auto">public static void main(String[] args) throws InterruptedException {</div><div dir="auto"> Executor executor = Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-", 1).factory());</div><div dir="auto"> Executor executor2 = new ThreadPoolExecutor(</div><div dir="auto"> 200,</div><div dir="auto"> Integer.MAX_VALUE,</div><div dir="auto"> 0L,</div><div dir="auto"> java.util.concurrent.TimeUnit.SECONDS,</div><div dir="auto"> new SynchronousQueue<>(),</div><div dir="auto"> Thread.ofVirtual().name("test-threadpool-", 1).factory()</div><div dir="auto"> );</div><div dir="auto"><br></div><div dir="auto"> // Warm-up</div><div dir="auto"> for (int i = 0; i < 10100; i++) {</div><div dir="auto"> executor.execute(() -> {</div><div dir="auto"> // simulate I/O wait</div><div dir="auto"> try { Thread.sleep(100); } catch (InterruptedException e) { throw new RuntimeException(e); }</div><div dir="auto"> });</div><div dir="auto"> executor2.execute(() -> {</div><div dir="auto"> // simulate I/O wait</div><div dir="auto"> try { Thread.sleep(100); } catch (InterruptedException e) { throw new RuntimeException(e); }</div><div dir="auto"> });</div><div dir="auto"> }</div><div dir="auto"><br></div><div dir="auto"> // Ensure JIT + warm-up complete</div><div dir="auto"> Thread.sleep(5000);</div><div dir="auto"><br></div><div dir="auto"> long start = System.currentTimeMillis();</div><div dir="auto"> CountDownLatch countDownLatch = new CountDownLatch(50000);</div><div dir="auto"> for (int i = 0; i < 50000; i++) {</div><div dir="auto"> executor.execute(() -> {</div><div dir="auto"> try { Thread.sleep(100); countDownLatch.countDown(); } catch (InterruptedException e) { throw new RuntimeException(e); }</div><div dir="auto"> });</div><div dir="auto"> }</div><div dir="auto"> countDownLatch.await();</div><div dir="auto"> System.out.println("thread time: " + (System.currentTimeMillis() - start) + " ms");</div><div dir="auto"><br></div><div dir="auto"> start = System.currentTimeMillis();</div><div dir="auto"> CountDownLatch countDownLatch2 = new CountDownLatch(50000);</div><div dir="auto"> for (int i = 0; i < 50000; i++) {</div><div dir="auto"> executor2.execute(() -> {</div><div dir="auto"> try { Thread.sleep(100); countDownLatch2.countDown(); } catch (InterruptedException e) { throw new RuntimeException(e); }</div><div dir="auto"> });</div><div dir="auto"> }</div><div dir="auto"> countDownLatch.await();</div><div dir="auto"> System.out.println("thread pool time: " + (System.currentTimeMillis() - start) + " ms");</div><div dir="auto">}</div><div dir="auto">```</div><div dir="auto"><br></div><div dir="auto">Result summary</div><div dir="auto">- In my runs, the pooled virtual-thread executor (executor2) performed better than the unpooled per-task virtual-thread executor.</div><div dir="auto">- Even when I increased load by 10x or 100x, the pooled virtual-thread executor still showed better performance.</div><div dir="auto">- In realistic workloads, it seems pooling some virtual threads reduces allocation/GC overhead and improves throughput compared to strictly unpooled virtual threads.</div><div dir="auto"><br></div><div dir="auto">Final thought / request for feedback</div><div dir="auto">- From my perspective, for systems originally tuned for platform-thread pools, partially pooling virtual threads seems to have no obvious downside and can restore ThreadLocal cache effectiveness used by many third-party libraries.</div><div dir="auto">- If I’ve misunderstood JEP 444 recommendations, virtual-thread semantics, or ThreadLocal behavior, please point out what I’m missing. I’d appreciate your guidance.<br><br><div data-smartmail="gmail_signature" dir="auto">Best Regards.<br>Jianbin Chen, github-id: funky-eyes </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Alan Bateman <<a href="mailto:alan.bateman@oracle.com" rel="noreferrer noreferrer noreferrer" target="_blank">alan.bateman@oracle.com</a>> 于 2026年1月23日周五 17:27写道:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 23/01/2026 07:30, Jianbin Chen wrote:<br>
> :<br>
><br>
> So my question is:<br>
><br>
> **In scenarios where third-party libraries heavily rely on ThreadLocal <br>
> for caching / buffering (and we cannot change those libraries to use <br>
> object pools instead), is explicitly pooling virtual threads (using a <br>
> ThreadPoolExecutor with virtual thread factory) considered a <br>
> recommended / acceptable workaround?**<br>
><br>
> Or are there better / more idiomatic ways to handle this kind of <br>
> compatibility issue with legacy ThreadLocal-based libraries when <br>
> migrating to virtual threads?<br>
><br>
> I have already opened a related discussion in the Dubbo project (since <br>
> Dubbo is one of the libraries affected in our stack):<br>
><br>
> <a href="https://github.com/apache/dubbo/issues/16042" rel="noreferrer noreferrer noreferrer noreferrer noreferrer" target="_blank">https://github.com/apache/dubbo/issues/16042</a><br>
><br>
> Would love to hear your thoughts — especially from people who have <br>
> experience running large-scale virtual-thread-based services with <br>
> mixed third-party dependencies.<br>
><br>
<br>
The guidelines that we put in JEP 444 [1] is to not pool virtual threads <br>
and to avoid caching costing resources in thread locals. Virtual threads <br>
support thread locals of course but that is not useful when some library <br>
is looking to share a costly resource between tasks that run on the same <br>
thread in a thread pool.<br>
<br>
I don't know anything about Aerospike but working with the maintainers <br>
of that library to re-work its buffer management seems like the right <br>
course of action here. Your mail says "byte buffers". If this is <br>
ByteBuffer it might be that they are caching direct buffers as they are <br>
expensive to create (and managed by the GC). Maybe they could look at <br>
using MemorySegment (it's easy to get a ByteBuffer view of a memory <br>
segment) and allocate from an arena that better matches the lifecycle.<br>
<br>
Hopefully others will share their experiences with migration as it is <br>
indeed challenging to migrate code developed for thread pools to work <br>
efficiently on virtual threads where there is 1-1 relationship between <br>
the task to execute and the thread.<br>
<br>
-Alan<br>
<br>
[1] <a href="https://openjdk.org/jeps/444#Thread-local-variables" rel="noreferrer noreferrer noreferrer noreferrer noreferrer" target="_blank">https://openjdk.org/jeps/444#Thread-local-variables</a><br>
</blockquote></div></div></div>
</div></blockquote></div></blockquote></div>
</div></blockquote></div></blockquote></div>
</div></blockquote></div></blockquote></div>
</div></blockquote></body></html>