Possible network issue

Wed Nov 8 00:59:29 UTC 2023

Further analysis shows no Socket leaks. Running the netstat after the test harness completes consistently shows a very small amount of memory in use, e.g.

iMac:helidon_test robertengels$ netstat -nm
595/54465 mbufs in use:
	347 mbufs allocated to data
	8 mbufs allocated to packet headers
	240 mbufs allocated to packet tags
	53870 mbufs allocated to caches
280/11770 mbuf 2KB clusters in use
0/43368 mbuf 4KB clusters in use
0/10922 mbuf 16KB clusters in use
398566 KB allocated to network (0.9% in use)
0 KB returned to the system
0 requests for memory denied
12205 requests for memory delayed
166 calls to drain routines

Most importantly, if I lower the data transfer amount per connection, the test completes without failures AND the number of ‘requests for memory delayed’ does not increase.

> On Nov 7, 2023, at 6:50 PM, robert engels <rengels at ix.netcom.com> wrote:
> 
> I ran the test multiple times - each had failures. As soon as a failure (SocketException) was reported I ran the netstat.
> 
> The reports of ‘memory delayed’ seems to align with the number of reported failures - which would mean that the OS is returning ENOBUFS as a temporary condition.
> 
> Here are the results:
> 
> iMac:helidon_test robertengels$ netstat -nm
> 9138/15954 mbufs in use:
> 	8231 mbufs allocated to data
> 	1 mbufs allocated to ancillary data
> 	665 mbufs allocated to packet headers
> 	1 mbufs allocated to socket names and addresses
> 	240 mbufs allocated to packet tags
> 	6816 mbufs allocated to caches
> 941/2730 mbuf 2KB clusters in use
> 3/50678 mbuf 4KB clusters in use
> 4100/10922 mbuf 16KB clusters in use
> 398566 KB allocated to network (18.2% in use)
> 0 KB returned to the system
> 0 requests for memory denied
> 10163 requests for memory delayed
> 131 calls to drain routines
> iMac:helidon_test robertengels$ netstat -nm
> 29711/29910 mbufs in use:
> 	27064 mbufs allocated to data
> 	2407 mbufs allocated to packet headers
> 	240 mbufs allocated to packet tags
> 	199 mbufs allocated to caches
> 3136/3238 mbuf 2KB clusters in use
> 3121/51132 mbuf 4KB clusters in use
> 10728/10922 mbuf 16KB clusters in use
> 398566 KB allocated to network (50.3% in use)
> 0 KB returned to the system
> 0 requests for memory denied
> 10342 requests for memory delayed
> 136 calls to drain routines
> iMac:helidon_test robertengels$ netstat -nm
> 61785/76680 mbufs in use:
> 	54018 mbufs allocated to data
> 	7527 mbufs allocated to packet headers
> 	240 mbufs allocated to packet tags
> 	14895 mbufs allocated to caches
> 7991/10098 mbuf 2KB clusters in use
> 20468/44823 mbuf 4KB clusters in use
> 10909/10922 mbuf 16KB clusters in use
> 398566 KB allocated to network (72.9% in use)
> 0 KB returned to the system
> 0 requests for memory denied
> 10407 requests for memory delayed
> 136 calls to drain routines
> iMac:helidon_test robertengels$ netstat -nm
> 18438/23213 mbufs in use:
> 	14924 mbufs allocated to data
> 	3274 mbufs allocated to packet headers
> 	240 mbufs allocated to packet tags
> 	4775 mbufs allocated to caches
> 4168/4346 mbuf 2KB clusters in use
> 1245/49996 mbuf 4KB clusters in use
> 6159/10922 mbuf 16KB clusters in use
> 398566 KB allocated to network (29.9% in use)
> 0 KB returned to the system
> 0 requests for memory denied
> 10668 requests for memory delayed
> 141 calls to drain routines
> iMac:helidon_test robertengels$ netstat -nm
> 19765/22908 mbufs in use:
> 	15860 mbufs allocated to data
> 	3665 mbufs allocated to packet headers
> 	240 mbufs allocated to packet tags
> 	3143 mbufs allocated to caches
> 3975/4130 mbuf 2KB clusters in use
> 10/50751 mbuf 4KB clusters in use
> 7783/10922 mbuf 16KB clusters in use
> 398566 KB allocated to network (35.2% in use)
> 0 KB returned to the system
> 0 requests for memory denied
> 11091 requests for memory delayed
> 146 calls to drain routines
> iMac:helidon_test robertengels$ 
> 
> I then killed the server process and ran netstat again, and it reported:
> 
> iMac:~ robertengels$ netstat -nm
> 596/11888 mbufs in use:
> 	348 mbufs allocated to data
> 	8 mbufs allocated to packet headers
> 	240 mbufs allocated to packet tags
> 	11292 mbufs allocated to caches
> 280/2730 mbuf 2KB clusters in use
> 0/51669 mbuf 4KB clusters in use
> 0/10922 mbuf 16KB clusters in use
> 398566 KB allocated to network (0.9% in use)
> 0 KB returned to the system
> 0 requests for memory denied
> 11091 requests for memory delayed
> 146 calls to drain routines
> iMac:~ robertengels$ 
> 
> It could be that the server process is leaking sockets - but I don’t think so. I am going to perform some heap analysis to make sure.
> 
>> On Nov 7, 2023, at 10:59 AM, Robert Engels <rengels at ix.netcom.com <mailto:rengels at ix.netcom.com>> wrote:
>> 
>> Will do. It does seem temporary. For instance with h2load testing, with 100k requests it will fail 10k of them - but this may be overstated by 10x since I have it set to pipeline 10 requests per connection. 
>> 
>> If I let the system sit for a bit - probably cleaning up the TIMEDWAIT connections - it will often run without failure. But rapid invocations of the test lead to the cited exception. 
>> 
>>> On Nov 7, 2023, at 10:33 AM, Alan Bateman <Alan.Bateman at oracle.com <mailto:Alan.Bateman at oracle.com>> wrote:
>>> 
>>> On 07/11/2023 13:28, Robert Engels wrote:
>>>> I suspect that what is happening is that the JVM gets the epoll notification that the socket is writable but by the time the writer runs the network buffers have been exhausted.
>>>> 
>>>> This should be an expected condition and seems like a JDK bug introduced with VT support. I checked earlier JDK releases (jdk15) and the sync -> non blocking code was not present in SocketOutputStream.java
>>> I don't think this is specific to virtual threads as we're seeing it elsewhere too. If there is no space in the socket write buffer then EAGAIN/EWOULDBLOCK is handled.  The intermittent ENOBUFS, which we're only seen on macos-aarch64 when under load, is hard to explain. We've had also some intermittent ENOMEM when joining multicast groups, also not specific to virtual threads.
>>> 
>>> Your mail said you can duplicate this readily. It would be useful if could capture the output of `netstat -nm` at around the time that this happens so see if this is a real resource exhaustion issue or not.
>>> 
>>> -Alan
>>> 
>>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20231107/810e7381/attachment-0001.htm>