Loom and high performance networking

Mon Aug 12 19:53:06 UTC 2024

Using VT pollers is about the same performance and still 100% CPU utilization. Using VT pollers with reduced parallelism is even slower, but it improved the worst case performance, and lowered the variance considerably:

VT pollers:

robertengels at macmini go-wrk % ./go-wrk -c 1000 -d 30 -T 10000 http://imac:8080/plaintext
Running 30s test @ http://imac:8080/plaintext
  1000 goroutine(s) running concurrently
3545553 requests in 29.463345004s, 422.66MB read
Requests/sec:		120337.76
Transfer/sec:		14.35MB
Overall Requests/sec:	115846.04
Overall Transfer/sec:	13.81MB
Fastest Request:	144µs
Avg Req Time:		8.309ms
Slowest Request:	1.418687s
Number of Errors:	2200
Error Counts:		connection reset by peer=2174,net/http: timeout awaiting response headers=26
10%:			2.33ms
50%:			2.869ms
75%:			3.004ms
99%:			3.093ms
99.9%:			3.096ms
99.9999%:		3.097ms
99.99999%:		3.097ms
stddev:			34.259ms

VT pollers with parallelism=3:

robertengels at macmini go-wrk % ./go-wrk -c 1000 -d 30 -T 10000 http://imac:8080/plaintext
Running 30s test @ http://imac:8080/plaintext
  1000 goroutine(s) running concurrently
3351695 requests in 29.724461752s, 399.55MB read
Requests/sec:		112758.81
Transfer/sec:		13.44MB
Overall Requests/sec:	110119.93
Overall Transfer/sec:	13.13MB
Fastest Request:	145µs
Avg Req Time:		8.868ms
Slowest Request:	411.775ms
Number of Errors:	2308
Error Counts:		connection reset by peer=2308
10%:			3.429ms
50%:			3.867ms
75%:			3.975ms
99%:			4.076ms
99.9%:			4.08ms
99.9999%:		4.081ms
99.99999%:		4.081ms
stddev:			5.346ms

> On Aug 12, 2024, at 2:34 PM, Robert Engels <robaho at icloud.com> wrote:
> 
> I tested previously using VT pollers and saw no difference in performance - I expect it would be worse - as it would take even longer for the poller to run as it would be behind all of the other VT tasks. I think reducing the parallelism with VT pollers would be even worse still.
> 
>> On Aug 12, 2024, at 2:32 PM, Robert Engels <robaho at icloud.com> wrote:
>> 
>> Also, the JDK poller may be specific in that I am on OSX, but when I look at the default poller mode code, it is system threads: https://github.com/openjdk/loom/blob/cfcc05f57f0b5f212493da3a2abbac6abed2a48e/src/java.base/share/classes/sun/nio/ch/PollerProvider.java#L48
>> 
>> I think for Linux it is changed to be VT threads : https://github.com/openjdk/loom/blob/cfcc05f57f0b5f212493da3a2abbac6abed2a48e/src/java.base/linux/classes/sun/nio/ch/DefaultPollerProvider.java#L38
>> 
>>> On Aug 12, 2024, at 2:29 PM, Robert Engels <robaho at icloud.com> wrote:
>>> 
>>> Sorry, I should have included that. I am using (build 24-loom+3-33) - which is still using Pollers.
>>> 
>>> As to sizing the pool - it would need to be based on the current connect count and their activity (and the configuration of number of pollers, etc.) - which I think for many (most?) systems would be hard to predict?
>>> 
>>> This is why I contend that it would be better to lower the native priority of the carrier threads - I think it solves the sizing problem nicely (sorry the pun).
>>> 
>>> If the available CPU is being exhausted by platform threads, then most likely the VT threads shouldn’t run at all (especially since they are non timesliced) - as the system is already in an overloaded state - and this would accomplish that.
>>> 
>>> In this particular case it is a very high load, so unless I am misunderstanding you, I don’t think the scheduler is prioritizing correctly - since lowering the parallelism improves the situation. 
>>> 
>>>> On Aug 12, 2024, at 2:17 PM, Ron Pressler <ron.pressler at oracle.com> wrote:
>>>> 
>>>> What JDK are you using? I believe that as of JDK 22 there are no longer poller threads (polling is done by virtual threads running under the same scheduler). If you haven’t tried with JDK 22, try it; you may get better results.
>>>> 
>>>> There is an inherent tension between scheduling under high loads and scheduling under lower loads. The problem is that to make a scheduler highly efficient at high loads you need to minimise the coordination among threads, which means that under utilisation cannot be easily detected (it’s easy for a thread to unilaterally detect it’s under heavy load when its submission queue grows, but detecting that fewer threads are needed requires coordination).
>>>> 
>>>> The virtual thread scheduler is a work-stealing scheduler that prioritises higher loads over lower loads. In the future we may allow plugging in schedulers that are more adaptive to changing loads in exchange for being less efficient under high loads. Note that sizing the virtual thread scheduler is no harder than sizing a thread pool. The difference is that people are more used to more adaptive but less efficient thread pools.
>>>> 
>>>> — Ron
>>>> 
>>>> 
>>>>> On 12 Aug 2024, at 20:03, robert engels <robaho at icloud.com> wrote:
>>>>> 
>>>>> Hi.  
>>>>> 
>>>>> I believe I have discovered what is essentially a priority inversion problem in the Loom network stack.
>>>>> 
>>>>> I have been comparing a NIO based http server framework with one using VT.
>>>>> 
>>>>> The VT framework when using the Loom defaults consistently uses significantly more CPU and achieves less throughput and overall performance.
>>>>> 
>>>>> By lowering the parallelism, the VT framework exceeds the NIO framework in throughput and overall performance.
>>>>> 
>>>>> My research and hypothesis is that if you create as many carrier threads as CPUs, then they compete withe the network poller threads. By reducing the number of carrier threads, the poller can run. If the poller can’t run, then many/all of the carrier threads will eventually park waiting for a runnable task, but there won’t be one until the poller runs - which is why the poller must have priority over the carrier threads.
>>>>> 
>>>>> I believe the best solution would be to lower the native priority of carrier threads as explicitly configuring the number of carrier threads accurately will be nearly impossible for varied workloads. I suspect this would also help GC be more deterministic as well.
>>>>> 
>>>>> Using default parallelism:
>>>>> 
>>>>> robertengels at macmini go-wrk % ./go-wrk -c 1000 -d 30 -T 10000 http://imac:8080/plaintext
>>>>> Running 30s test @ http://imac:8080/plaintext
>>>>> 1000 goroutine(s) running concurrently
>>>>> 3557027 requests in 29.525311463s, 424.03MB read
>>>>> Requests/sec: 120473.82
>>>>> Transfer/sec: 14.36MB
>>>>> Overall Requests/sec: 116021.08
>>>>> Overall Transfer/sec: 13.83MB
>>>>> Fastest Request: 112µs
>>>>> Avg Req Time: 8.3ms
>>>>> Slowest Request: 839.967ms
>>>>> Number of Errors: 2465
>>>>> Error Counts: broken pipe=2,connection reset by peer=2451,net/http: timeout awaiting response headers=12
>>>>> 10%: 2.102ms
>>>>> 50%: 2.958ms
>>>>> 75%: 3.108ms
>>>>> 99%: 3.198ms
>>>>> 99.9%: 3.201ms
>>>>> 99.9999%: 3.201ms
>>>>> 99.99999%: 3.201ms
>>>>> stddev: 32.52ms
>>>>> 
>>>>> and using reduced parallelism of 3:
>>>>> 
>>>>> robertengels at macmini go-wrk % ./go-wrk -c 1000 -d 30 -T 10000 http://imac:8080/plaintext
>>>>> Running 30s test @ http://imac:8080/plaintext
>>>>> 1000 goroutine(s) running concurrently
>>>>> 4059418 requests in 29.092649689s, 483.92MB read
>>>>> Requests/sec: 139534.14
>>>>> Transfer/sec: 16.63MB
>>>>> Overall Requests/sec: 132608.44
>>>>> Overall Transfer/sec: 15.81MB
>>>>> Fastest Request: 115µs
>>>>> Avg Req Time: 7.166ms
>>>>> Slowest Request: 811.999ms
>>>>> Number of Errors: 2361
>>>>> Error Counts: net/http: timeout awaiting response headers=51,connection reset by peer=2310
>>>>> 10%: 1.899ms
>>>>> 50%: 2.383ms
>>>>> 75%: 2.478ms
>>>>> 99%: 2.541ms
>>>>> 99.9%: 2.543ms
>>>>> 99.9999%: 2.544ms
>>>>> 99.99999%: 2.544ms
>>>>> stddev: 32.88ms
>>>>> 
>>>>> More importantly, the reduced parallelism has a cpu idle percentage of 30% (which matches the NIO framework) whereas the default parallelism has an idle of near 0 % (due to scheduler thrashing).
>>>>> 
>>>>> The attached JFR screenshot (I have also attached the JFR captures) tells the story. #2 is the VT with default parallelism. #3 is the NIO based framework, and #4 is VT with reduced parallelism. #2 clearly shows the thrashing that is occurring with threads parking and unparking and the scheduler waiting for work.
>>>>> 
>>>>> <PastedGraphic-1.png> 
>>>>> 
>>>>> 
>>>>> <profile2.jfr><profile3.jfr><profile4.jfr>
>>>> 
>>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240812/27b48ef9/attachment-0001.htm>