Performance of pooling virtual threads vs. semaphores

Thu May 30 12:24:45 UTC 2024

A small correction: "They only create 600 VT" should be platform thread,
not VT of course for the fork join pool.

Also note: That the scaling difference between option 3 and option 2 is not
large. If it was simply the problem with creating more VTs upfront, then we
would have to see a similar increase in difference between case 3 and 2.

Attila Kelemen <attila.kelemen85 at gmail.com> ezt írta (időpont: 2024. máj.
30., Cs, 14:20):

> They only create 600 VT, but they do create 1M queue entries for the
> executor, and the relative memory usage should be the same for the scenario
> of 10k tasks and the 1M (both in terms of bytes and number of objects). I
> would love to see the result of this experiment with the epsilon GC (given
> that the total memory usage should be manageable even for 1M tasks) to
> confirm or exclude the possibility of the GC scaling this noticeably poorly.
>
> Robert Engels <rengels at ix.netcom.com> ezt írta (időpont: 2024. máj. 30.,
> Cs, 14:10):
>
>> That is what I pointed out - in scenario 2 you are creating 1M VT up
>> front. The other cases only create at most 600 VT or platform threads.
>>
>> The peak memory usage in scenario 2 is much much higher.
>>
>> On May 30, 2024, at 7:07 AM, Attila Kelemen <attila.kelemen85 at gmail.com>
>> wrote:
>>
>> 
>> Though the additional work the VT has to do is understandable. However, I
>> don't see them explaining these measurements. Because in the case of 10k
>> tasks VT wins over FJP, but with 1M tasks, VT loses to FJP. What is the
>> source of the scaling difference, when there are still only 128 carriers,
>> and 600 concurrent threads in both cases? If this was merely more work,
>> then I would expect to see the same relative difference between FJP and VT
>> when there are 10k tasks and when there are 1M tasks. Just a wild naive
>> guess: Could the GC scale worse for that many VTs, or is that a stupid idea?
>>
>>
>>>
>>> If the concurrency for the virtual thread run is limited to the same
>>> value as the thread count in the thread pool runs then you are unlikely
>>> to see benefit. The increased CPU time probably isn't too surprising
>>> either. In the two runs with threads then the N task are queued once. In
>>> the virtual thread run then the tasks for the N virtual threads may be
>>> queued up to 4 times, one for the initial submit, one waiting for
>>> semaphore permit, and twice for the two sleeps. Also when CPU
>>> utilization is low (as I assume it is here) then the FJP scan does tend
>>> up to show up in profiles.
>>>
>>> Has Chi looked into increasing the concurrency so that it's not limited
>>> to 600? Concurrency may need limited at finer grain the "real world
>>> program", but may not the number of threads.
>>>
>>> -Alan
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240530/63ba8984/attachment-0001.htm>