DatagramChannel performance issue

Sun Sep 2 17:28:14 UTC 2018

I've put together some benchmarks and can quantify the impact of the
current implementation.

https://github.com/mjpt777/java-benchmarks

If we consider the latency impact via loopback, which is similar to a
fast kernel bypass network, there is a 25% increase in round trip
latency with two sources compared to one source. This is apply dual
sources in only one direction so half the actual RTT impact. With Java
11 the delta is slightly increased between the single and dual source
cases.

https://github.com/mjpt777/java-benchmarks/blob/master/results/latency.txt

This suggests a 50% total cost increase for each one-way transmission
when the source does not match and a new address has to be created and
set in the cache field. When profiled the timing should reflect this
hypothesis. It does in the flame graphs.

https://github.com/mjpt777/java-benchmarks/blob/master/src/jmh/java/uk/co/real_logic/benchmarks/nio/MultiThreadedConnectedDatagramChannelBenchmark.java
https://github.com/mjpt777/java-benchmarks/blob/master/results/cpu-single-source.svg

https://github.com/mjpt777/java-benchmarks/blob/master/src/jmh/java/uk/co/real_logic/benchmarks/nio/MultiThreadedSeparateConnectedDatagramChannelBenchmark.java
https://github.com/mjpt777/java-benchmarks/blob/master/results/cpu-dual-source.svg

With a single source per channel, we see the majority of the time is
spent in the underlying receive from network. When two sources exist
then ~50% of the time is spent in the JNI code, and note the up call
to allocate the address object which seems high given such a
relatively small object. The NET_SockaddrToInetAddress is the other
major component of the cost within the JNI call.

Having compared native C implementations I see a further saving of >
0.5μs over the single source case which suggests the JNI and
DatagramChannel code has significant opportunities to be more
efficient. Maybe JNI up calls are not one of the better design
choices.

Most services have more than one source and whatever approach is taken
it is best if allocation is not utilised unless we have a change of
configuration. GC pauses can result in data loss, retransmits, and
network partitions. Nodes appearing unavailable is far more commonly
due to GC than actual network partitions in managed languages.

Regards,
Martin...

On Tue, 28 Aug 2018 at 08:31, Alan Bateman <Alan.Bateman at oracle.com> wrote:
>> As regards implementation changes then I think start by eliminating the
> JNI calls from the receive implementation and move it all to java. Each
> DCI can have an area of memory (probably 18 bytes) for the native code
> to write the source address and port. The wrapper can create the
> InetAddress/InetSocketAddress. Once you get that far then you can
> evaluating if there is a need for any caching of InetAddress objects,
> maybe the existing one-entry cache can be removed, maybe it would be
> beneficial to have more caching. The benchmarks, along with other
> measurements, should be help evaluate.