Loom and high performance networking

Wed Aug 14 16:01:17 UTC 2024

Also, for comparison purposes, using platform threads over virtual threads with default parallelism shows much better avg/tail latency and a slightly less throughput.

Using platform thread per connection:

robertengels at macmini go-wrk %  wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    28.09ms   70.26ms 818.88ms   92.55%
    Req/Sec    53.96k     4.02k   63.58k    71.25%
  Latency Distribution
     50%    8.51ms
     75%    9.44ms
     90%   41.65ms
     99%  378.31ms
  2148390 requests in 20.03s, 290.94MB read
Requests/sec: 107280.14
Transfer/sec:     14.53MB

Using virtual threads with default parallelism:

robertengels at macmini go-wrk %  wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    67.63ms  146.27ms   2.42s    87.57%
    Req/Sec    55.05k     4.93k   70.66k    73.50%
  Latency Distribution
     50%    5.63ms
     75%   48.40ms
     90%  262.75ms
     99%  674.56ms
  2192116 requests in 20.03s, 296.86MB read
Requests/sec: 109430.53
Transfer/sec:     14.82MB

> On Aug 14, 2024, at 10:47 AM, Robert Engels <robaho at icloud.com> wrote:
> 
> I was able to correct the errors. The initial accepts were failing. I adjusted kern.ipc.somaxconn appropriately and it was resolved. So back to the problem. I have a new theory that maybe someone can validate.
> 
> I theorize that without reducing the parallelism all cores are in use, which means that any GC activity has no cores to run on and competes with the carrier thread - introducing latency - and since this is a ping/pong RTT type test, the delay is enough to make the processors go into a park/unpark cycle.
> 
> Which leads me to the question does Java need a better (or dynamic) way of determining the optimum parallelism based on load? It seems the only way to set the parallelism is at startup, and it is constant.
> 
> With default parallelism:
> 
> robertengels at macmini go-wrk %  wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext <http://imac:8080/plaintext>
> Running 20s test @ http://imac:8080/plaintext <http://imac:8080/plaintext>
>   2 threads and 1000 connections
>   Thread Stats   Avg      Stdev     Max   +/- Stdev
>     Latency    83.98ms  199.72ms   4.22s    89.05%
>     Req/Sec    51.75k     5.30k   61.84k    75.75%
>   Latency Distribution
>      50%    5.29ms
>      75%   65.42ms
>      90%  301.31ms
>      99%  788.27ms
>   2059873 requests in 20.03s, 278.95MB read
> Requests/sec: 102854.15
> Transfer/sec:     13.93MB
> 
> and with reduced parallelism (3):
> 
> robertengels at macmini go-wrk %  wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext <http://imac:8080/plaintext>
> Running 20s test @ http://imac:8080/plaintext <http://imac:8080/plaintext>
>   2 threads and 1000 connections
>   Thread Stats   Avg      Stdev     Max   +/- Stdev
>     Latency    10.08ms   13.10ms 408.99ms   96.65%
>     Req/Sec    56.10k     5.95k   65.77k    74.25%
>   Latency Distribution
>      50%    8.55ms
>      75%    9.86ms
>      90%   13.21ms
>      99%   45.54ms
>   2232236 requests in 20.02s, 302.29MB read
> Requests/sec: 111475.38
> Transfer/sec:     15.10MB
> 
> notice the huge difference in average, tail and variance of latency. The more than order of magnitude in max latency is very concerning.
> 
>> On Aug 14, 2024, at 8:44 AM, robert engels <robaho at icloud.com <mailto:robaho at icloud.com>> wrote:
>> 
>> Hi Alan. I agree and I’ll try to resolve this today. But what I am struggling with is why the parallelism is the only thing that matters. The poller mode or number of pollers make no difference. 
>> 
>> If I lower the parallelism I get better performance (matches the NIO - all systems have roughly the same error rate regardless of throughput). 
>> 
>>> On Aug 14, 2024, at 1:15 AM, Alan Bateman <Alan.Bateman at oracle.com <mailto:Alan.Bateman at oracle.com>> wrote:
>>> 
>>> 
>>> 
>>> On 13/08/2024 16:34, robert engels wrote:
>>>> I did. It didn’t make any difference. I checked the thread dump as well and the extras were created. 
>>>> 
>>>> Surprised that lowering the priority didn’t help - so now I need to think about other options. It feels like something when the carriers can use all the cores that the poller is prevented from running - like some sort of lock being held by the carrier/vt and do it thrashes around until it eventually gets a chance. 
>>>> 
>>> With pollerMode=2 then the the work is done on virtual threads so I assume you don't see the master poller stealing CPU cycles.
>>> 
>>> In any case, what you describe sounds a bit like bursty usage where all FJP workers are scanning for work before parking, and/or something else that may be macOS specific when running out of some resource. I suspect tuning the networking params to remove the errors/timeouts might make it a bit easier to study.
>>> 
>>> -Alan
>>> 
>>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240814/8ed67fbc/attachment-0001.htm>