Loom and high performance networking
Robert Engels
robaho at icloud.com
Wed Aug 14 15:47:09 UTC 2024
I was able to correct the errors. The initial accepts were failing. I adjusted kern.ipc.somaxconn appropriately and it was resolved. So back to the problem. I have a new theory that maybe someone can validate.
I theorize that without reducing the parallelism all cores are in use, which means that any GC activity has no cores to run on and competes with the carrier thread - introducing latency - and since this is a ping/pong RTT type test, the delay is enough to make the processors go into a park/unpark cycle.
Which leads me to the question does Java need a better (or dynamic) way of determining the optimum parallelism based on load? It seems the only way to set the parallelism is at startup, and it is constant.
With default parallelism:
robertengels at macmini go-wrk % wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 83.98ms 199.72ms 4.22s 89.05%
Req/Sec 51.75k 5.30k 61.84k 75.75%
Latency Distribution
50% 5.29ms
75% 65.42ms
90% 301.31ms
99% 788.27ms
2059873 requests in 20.03s, 278.95MB read
Requests/sec: 102854.15
Transfer/sec: 13.93MB
and with reduced parallelism (3):
robertengels at macmini go-wrk % wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 10.08ms 13.10ms 408.99ms 96.65%
Req/Sec 56.10k 5.95k 65.77k 74.25%
Latency Distribution
50% 8.55ms
75% 9.86ms
90% 13.21ms
99% 45.54ms
2232236 requests in 20.02s, 302.29MB read
Requests/sec: 111475.38
Transfer/sec: 15.10MB
notice the huge difference in average, tail and variance of latency. The more than order of magnitude in max latency is very concerning.
> On Aug 14, 2024, at 8:44 AM, robert engels <robaho at icloud.com> wrote:
>
> Hi Alan. I agree and I’ll try to resolve this today. But what I am struggling with is why the parallelism is the only thing that matters. The poller mode or number of pollers make no difference.
>
> If I lower the parallelism I get better performance (matches the NIO - all systems have roughly the same error rate regardless of throughput).
>
>> On Aug 14, 2024, at 1:15 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
>>
>>
>>
>> On 13/08/2024 16:34, robert engels wrote:
>>> I did. It didn’t make any difference. I checked the thread dump as well and the extras were created.
>>>
>>> Surprised that lowering the priority didn’t help - so now I need to think about other options. It feels like something when the carriers can use all the cores that the poller is prevented from running - like some sort of lock being held by the carrier/vt and do it thrashes around until it eventually gets a chance.
>>>
>> With pollerMode=2 then the the work is done on virtual threads so I assume you don't see the master poller stealing CPU cycles.
>>
>> In any case, what you describe sounds a bit like bursty usage where all FJP workers are scanning for work before parking, and/or something else that may be macOS specific when running out of some resource. I suspect tuning the networking params to remove the errors/timeouts might make it a bit easier to study.
>>
>> -Alan
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240814/03fcc165/attachment-0001.htm>
More information about the loom-dev
mailing list