Loom and high performance networking

Wed Aug 14 15:47:09 UTC 2024

I was able to correct the errors. The initial accepts were failing. I adjusted kern.ipc.somaxconn appropriately and it was resolved. So back to the problem. I have a new theory that maybe someone can validate.

I theorize that without reducing the parallelism all cores are in use, which means that any GC activity has no cores to run on and competes with the carrier thread - introducing latency - and since this is a ping/pong RTT type test, the delay is enough to make the processors go into a park/unpark cycle.

Which leads me to the question does Java need a better (or dynamic) way of determining the optimum parallelism based on load? It seems the only way to set the parallelism is at startup, and it is constant.

With default parallelism:

robertengels at macmini go-wrk %  wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    83.98ms  199.72ms   4.22s    89.05%
    Req/Sec    51.75k     5.30k   61.84k    75.75%
  Latency Distribution
     50%    5.29ms
     75%   65.42ms
     90%  301.31ms
     99%  788.27ms
  2059873 requests in 20.03s, 278.95MB read
Requests/sec: 102854.15
Transfer/sec:     13.93MB

and with reduced parallelism (3):

robertengels at macmini go-wrk %  wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.08ms   13.10ms 408.99ms   96.65%
    Req/Sec    56.10k     5.95k   65.77k    74.25%
  Latency Distribution
     50%    8.55ms
     75%    9.86ms
     90%   13.21ms
     99%   45.54ms
  2232236 requests in 20.02s, 302.29MB read
Requests/sec: 111475.38
Transfer/sec:     15.10MB

notice the huge difference in average, tail and variance of latency. The more than order of magnitude in max latency is very concerning.

> On Aug 14, 2024, at 8:44 AM, robert engels <robaho at icloud.com> wrote:
> 
> Hi Alan. I agree and I’ll try to resolve this today. But what I am struggling with is why the parallelism is the only thing that matters. The poller mode or number of pollers make no difference. 
> 
> If I lower the parallelism I get better performance (matches the NIO - all systems have roughly the same error rate regardless of throughput). 
> 
>> On Aug 14, 2024, at 1:15 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
>> 
>> 
>> 
>> On 13/08/2024 16:34, robert engels wrote:
>>> I did. It didn’t make any difference. I checked the thread dump as well and the extras were created. 
>>> 
>>> Surprised that lowering the priority didn’t help - so now I need to think about other options. It feels like something when the carriers can use all the cores that the poller is prevented from running - like some sort of lock being held by the carrier/vt and do it thrashes around until it eventually gets a chance. 
>>> 
>> With pollerMode=2 then the the work is done on virtual threads so I assume you don't see the master poller stealing CPU cycles.
>> 
>> In any case, what you describe sounds a bit like bursty usage where all FJP workers are scanning for work before parking, and/or something else that may be macOS specific when running out of some resource. I suspect tuning the networking params to remove the errors/timeouts might make it a bit easier to study.
>> 
>> -Alan
>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240814/03fcc165/attachment-0001.htm>