Loom and high performance networking

Mon Aug 12 19:17:47 UTC 2024

What JDK are you using? I believe that as of JDK 22 there are no longer poller threads (polling is done by virtual threads running under the same scheduler). If you haven’t tried with JDK 22, try it; you may get better results.

There is an inherent tension between scheduling under high loads and scheduling under lower loads. The problem is that to make a scheduler highly efficient at high loads you need to minimise the coordination among threads, which means that under utilisation cannot be easily detected (it’s easy for a thread to unilaterally detect it’s under heavy load when its submission queue grows, but detecting that fewer threads are needed requires coordination).

The virtual thread scheduler is a work-stealing scheduler that prioritises higher loads over lower loads. In the future we may allow plugging in schedulers that are more adaptive to changing loads in exchange for being less efficient under high loads. Note that sizing the virtual thread scheduler is no harder than sizing a thread pool. The difference is that people are more used to more adaptive but less efficient thread pools.

— Ron

> On 12 Aug 2024, at 20:03, robert engels <robaho at icloud.com> wrote:
> 
> Hi.  
> 
> I believe I have discovered what is essentially a priority inversion problem in the Loom network stack.
> 
> I have been comparing a NIO based http server framework with one using VT.
> 
> The VT framework when using the Loom defaults consistently uses significantly more CPU and achieves less throughput and overall performance.
> 
> By lowering the parallelism, the VT framework exceeds the NIO framework in throughput and overall performance.
> 
> My research and hypothesis is that if you create as many carrier threads as CPUs, then they compete withe the network poller threads. By reducing the number of carrier threads, the poller can run. If the poller can’t run, then many/all of the carrier threads will eventually park waiting for a runnable task, but there won’t be one until the poller runs - which is why the poller must have priority over the carrier threads.
> 
> I believe the best solution would be to lower the native priority of carrier threads as explicitly configuring the number of carrier threads accurately will be nearly impossible for varied workloads. I suspect this would also help GC be more deterministic as well.
> 
> Using default parallelism:
> 
> robertengels at macmini go-wrk % ./go-wrk -c 1000 -d 30 -T 10000 http://imac:8080/plaintext
> Running 30s test @ http://imac:8080/plaintext
>   1000 goroutine(s) running concurrently
> 3557027 requests in 29.525311463s, 424.03MB read
> Requests/sec: 120473.82
> Transfer/sec: 14.36MB
> Overall Requests/sec: 116021.08
> Overall Transfer/sec: 13.83MB
> Fastest Request: 112µs
> Avg Req Time: 8.3ms
> Slowest Request: 839.967ms
> Number of Errors: 2465
> Error Counts: broken pipe=2,connection reset by peer=2451,net/http: timeout awaiting response headers=12
> 10%: 2.102ms
> 50%: 2.958ms
> 75%: 3.108ms
> 99%: 3.198ms
> 99.9%: 3.201ms
> 99.9999%: 3.201ms
> 99.99999%: 3.201ms
> stddev: 32.52ms
> 
> and using reduced parallelism of 3:
> 
> robertengels at macmini go-wrk % ./go-wrk -c 1000 -d 30 -T 10000 http://imac:8080/plaintext
> Running 30s test @ http://imac:8080/plaintext
>   1000 goroutine(s) running concurrently
> 4059418 requests in 29.092649689s, 483.92MB read
> Requests/sec: 139534.14
> Transfer/sec: 16.63MB
> Overall Requests/sec: 132608.44
> Overall Transfer/sec: 15.81MB
> Fastest Request: 115µs
> Avg Req Time: 7.166ms
> Slowest Request: 811.999ms
> Number of Errors: 2361
> Error Counts: net/http: timeout awaiting response headers=51,connection reset by peer=2310
> 10%: 1.899ms
> 50%: 2.383ms
> 75%: 2.478ms
> 99%: 2.541ms
> 99.9%: 2.543ms
> 99.9999%: 2.544ms
> 99.99999%: 2.544ms
> stddev: 32.88ms
> 
> More importantly, the reduced parallelism has a cpu idle percentage of 30% (which matches the NIO framework) whereas the default parallelism has an idle of near 0 % (due to scheduler thrashing).
> 
> The attached JFR screenshot (I have also attached the JFR captures) tells the story. #2 is the VT with default parallelism. #3 is the NIO based framework, and #4 is VT with reduced parallelism. #2 clearly shows the thrashing that is occurring with threads parking and unparking and the scheduler waiting for work.
> 
> <PastedGraphic-1.png> 
> 
> 
> <profile2.jfr><profile3.jfr><profile4.jfr>