Loom and high performance networking
Robert Engels
robaho at icloud.com
Wed Aug 14 16:01:17 UTC 2024
Also, for comparison purposes, using platform threads over virtual threads with default parallelism shows much better avg/tail latency and a slightly less throughput.
Using platform thread per connection:
robertengels at macmini go-wrk % wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 28.09ms 70.26ms 818.88ms 92.55%
Req/Sec 53.96k 4.02k 63.58k 71.25%
Latency Distribution
50% 8.51ms
75% 9.44ms
90% 41.65ms
99% 378.31ms
2148390 requests in 20.03s, 290.94MB read
Requests/sec: 107280.14
Transfer/sec: 14.53MB
Using virtual threads with default parallelism:
robertengels at macmini go-wrk % wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext
Running 20s test @ http://imac:8080/plaintext
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 67.63ms 146.27ms 2.42s 87.57%
Req/Sec 55.05k 4.93k 70.66k 73.50%
Latency Distribution
50% 5.63ms
75% 48.40ms
90% 262.75ms
99% 674.56ms
2192116 requests in 20.03s, 296.86MB read
Requests/sec: 109430.53
Transfer/sec: 14.82MB
> On Aug 14, 2024, at 10:47 AM, Robert Engels <robaho at icloud.com> wrote:
>
> I was able to correct the errors. The initial accepts were failing. I adjusted kern.ipc.somaxconn appropriately and it was resolved. So back to the problem. I have a new theory that maybe someone can validate.
>
> I theorize that without reducing the parallelism all cores are in use, which means that any GC activity has no cores to run on and competes with the carrier thread - introducing latency - and since this is a ping/pong RTT type test, the delay is enough to make the processors go into a park/unpark cycle.
>
> Which leads me to the question does Java need a better (or dynamic) way of determining the optimum parallelism based on load? It seems the only way to set the parallelism is at startup, and it is constant.
>
> With default parallelism:
>
> robertengels at macmini go-wrk % wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext <http://imac:8080/plaintext>
> Running 20s test @ http://imac:8080/plaintext <http://imac:8080/plaintext>
> 2 threads and 1000 connections
> Thread Stats Avg Stdev Max +/- Stdev
> Latency 83.98ms 199.72ms 4.22s 89.05%
> Req/Sec 51.75k 5.30k 61.84k 75.75%
> Latency Distribution
> 50% 5.29ms
> 75% 65.42ms
> 90% 301.31ms
> 99% 788.27ms
> 2059873 requests in 20.03s, 278.95MB read
> Requests/sec: 102854.15
> Transfer/sec: 13.93MB
>
> and with reduced parallelism (3):
>
> robertengels at macmini go-wrk % wrk -H 'Host: imac' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 20 -c 1000 --timeout 8 -t 2 http://imac:8080/plaintext <http://imac:8080/plaintext>
> Running 20s test @ http://imac:8080/plaintext <http://imac:8080/plaintext>
> 2 threads and 1000 connections
> Thread Stats Avg Stdev Max +/- Stdev
> Latency 10.08ms 13.10ms 408.99ms 96.65%
> Req/Sec 56.10k 5.95k 65.77k 74.25%
> Latency Distribution
> 50% 8.55ms
> 75% 9.86ms
> 90% 13.21ms
> 99% 45.54ms
> 2232236 requests in 20.02s, 302.29MB read
> Requests/sec: 111475.38
> Transfer/sec: 15.10MB
>
> notice the huge difference in average, tail and variance of latency. The more than order of magnitude in max latency is very concerning.
>
>> On Aug 14, 2024, at 8:44 AM, robert engels <robaho at icloud.com <mailto:robaho at icloud.com>> wrote:
>>
>> Hi Alan. I agree and I’ll try to resolve this today. But what I am struggling with is why the parallelism is the only thing that matters. The poller mode or number of pollers make no difference.
>>
>> If I lower the parallelism I get better performance (matches the NIO - all systems have roughly the same error rate regardless of throughput).
>>
>>> On Aug 14, 2024, at 1:15 AM, Alan Bateman <Alan.Bateman at oracle.com <mailto:Alan.Bateman at oracle.com>> wrote:
>>>
>>>
>>>
>>> On 13/08/2024 16:34, robert engels wrote:
>>>> I did. It didn’t make any difference. I checked the thread dump as well and the extras were created.
>>>>
>>>> Surprised that lowering the priority didn’t help - so now I need to think about other options. It feels like something when the carriers can use all the cores that the poller is prevented from running - like some sort of lock being held by the carrier/vt and do it thrashes around until it eventually gets a chance.
>>>>
>>> With pollerMode=2 then the the work is done on virtual threads so I assume you don't see the master poller stealing CPU cycles.
>>>
>>> In any case, what you describe sounds a bit like bursty usage where all FJP workers are scanning for work before parking, and/or something else that may be macOS specific when running out of some resource. I suspect tuning the networking params to remove the errors/timeouts might make it a bit easier to study.
>>>
>>> -Alan
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240814/8ed67fbc/attachment-0001.htm>
More information about the loom-dev
mailing list