RFR: 8305744: (ch) InputStream returned by Channels.newInputStream should have fast path for ReadbleByteChannel/WritableByteChannel [v3]

Wed Jul 5 06:38:02 UTC 2023

On Sun, 9 Apr 2023 13:43:36 GMT, Markus KARG <duke at openjdk.org> wrote:

>> This sub-issue defines the work to be done to implement JDK-8265891 solely for the particular case of ReadableByteChannel-to-WritableByteChannel transfers, where neither Channel is a FileChannel, but including special treatment of SelectableByteChannel.
>> 
>> *Without* this optimization, the JRE applies a loop where a temporary 16k byte array on the Java heap is used to transport the data between both channels. This implies that data is first transported from the source channel into the Java heap and then from the Java heap to the target channel. For most Channels, as these are NIO citizen, this implies two transfers between Java heap and OS resources, which is rather time consuming. As the data is in no way modified by transferTo(), the temporary allocation of valuable heap memory is just as unnecessary as both transfers to and from the heap. It makes sense to prevent the use of Java heap memory.
>> 
>> This optimization omits the byte array on the Java heap, effectively omitting the transfers to and from that array in turn. Instead, the transfer happens directly from Channel to Channel reusing a ByteBuffer taken from the JRE's internal buffer management utility which also has a size of 16k. Both measures make the transfer more efficient, as just the very essential resources are used, but not more.
>> 
>> Using JMH on an in-memory data transfer using custom Channels (to prevent interception by other optimizations already existing in transferTo()) it was possible to measure a 2,5% performance gain. While this does not sound much, just as the spared 16k of Java heap doesn't, we need to consider that this also means sparing GC pressure, power consumption and CO2 footprint, and we need to consider that it could be higher when using different (possibly third-party) kinds of custom Channels (as "real I/O" between heap and OS-resources is way slower than the RAM-to-RAM benchmark applied by me). As the change does not induce new, extraordinary risks compared to the existing source code, I think it is worth gaining that 2,5+ percent. Possibly there might exist some kind of scalable Java-implemented cloud service that is bound to exactly this loop and that transports very heavy traffic, those particular 2,5+ percent could be a huge amount of kWh or CO2 in absolute numbers. Even if not now, t
 here might be in future.
>
> Markus KARG has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Putting the special casing for an output stream backed by a ChannelOutputStream in one place.

# TL;DR

This MR outperforms the master branch's code by one order of magnitude in the best case, but is never slower in the worst case.

# Proof

To proof this claim, Brian's benchmark is used with two different edge cases.

## Lower Bound: ByteArray-Streams

Using Brian's code (see above) I could proof that this MR is always at-least-as fast as the original code.

ByteArray-Streams using code from this MR:

Result "eu.sandbox.Sandbox.ChannelInputStream_transferTo":
  2.727 ±(99.9%) 0.096 ops/s [Average]
  (min, avg, max) = (2.526, 2.727, 2.900), stdev = 0.128
  CI (99.9%): [2.631, 2.823] (assumes normal distribution)

ByteArray-Streams using original code from master branch:

Result "eu.sandbox.Sandbox.ChannelInputStream_transferTo":
  2.725 ±(99.9%) 0.103 ops/s [Average]
  (min, avg, max) = (2.551, 2.725, 2.935), stdev = 0.138
  CI (99.9%): [2.622, 2.828] (assumes normal distribution)

The reason that no statistical evidence for *improved* performance could be found *in this edge case* is clearly bound to the fact that what we measure here mostly is the time needed by the ByteArray*Streams to transfer bytes between the byte arrays. This needs *very* long time - *much* more time than the actual code change that we actually wanted to benchmark. Copying Gigabytes between the byte arrays is *not* the target of this MR's optimization, but happens in exact the same way in the original code just as in the MR's code, and is (more or less) *solely* what Brian's bechnmark is actually measuring. Hence it is clear, that the statistical result reflects just the fact that copying Gigabytes costs a lot of time, but says rather nothing about the *actual* optimization this MR is about.

## Upper Bound: No-Op-Channels

In fact, what this MR optimizes is not Channel-wrapped ByteArray*Streams (even if Brian's benchmark has proven that they will not work any slower than before), but is -as I explained earlier- custom channels. As `Channel` is a public interface since very long time (starting with Java 1.4), it is possible and likely that custom implementations *do* exist in field use. There could be, for example, channels that use native memory *solely*, even there could be channels that do not transfer the data at all but simple forward buffers from reable channel to writable channel in a hardware-supported way. We do not know what those optimized custom channels actually work like, but as this optimization is about the proof that channel-to-channel transfer is (way) faster than channel-to-bytearray-to-channel transfer (as the original code works like), we actually do not care about the actual implementation of the custom channels. Moreover, we do not care about that, because both benchmark runs (the
  master code just as the MR's code) will read and write the channels in full length using the same public interface methods, so it is irrelevant what these methods actually do (as long as they do it correctly). Basing on this theory, I have modified Brian's code to not further use byte arrays in those channels, but to use No-Op-Channels. This means, channels that behave *correctly* (i. e. their interface methods work consistently with the spec, and the input channel provides the full length of bytes) but whose implementations do not *copy* any bytes. Using this variation of the benchmark, we now have the means to measure what we actually wanted to measure in the first place: How much faster is the MR's code compared to master's code? Hence: We do not measure anymore how much time the particular *implementation* of an arbitraritly chosen Input/OutputStream needs - again, as this is the same time for all tests as this code is *unchanged*.

Using No-Op-Channels I could proof that the actual modification is highly beneficial, as the MR's code clearly outperforms master's code by one full order of magnitude:

No-Op-Channel-Streams using code from this MR:

Result "eu.sandbox.Sandbox.ChannelInputStream_transferTo":
  3403.370 ±(99.9%) 287.694 ops/s [Average]
  (min, avg, max) = (1977.793, 3403.370, 3759.044), stdev = 384.063
  CI (99.9%): [3115.676, 3691.063] (assumes normal distribution)

No-Op-Channel-Streams using code from master:

Result "eu.sandbox.Sandbox.ChannelInputStream_transferTo":
  349.644 ±(99.9%) 27.853 ops/s [Average]
  (min, avg, max) = (240.305, 349.644, 366.043), stdev = 37.182
  CI (99.9%): [321.792, 377.497] (assumes normal distribution)

I assume that the statistical evidence is obvious, just as the reasoning is: Not having to move one Gigabyte of data from a ByteBuffer into a byte array and back once per test iteration certainly is (much) faster than doing it. While CPUs and memory paths are super fast these days, the transfer costs still are *not zero*, and one Gigabyte transferred twice per test makes this rather evident. That free CPU, MMU and transfer path resources (and the time needed for the spared GCs) can be used better for parallel tasks (or spared completely to spare CO2).

Below you can find the implementation of the Null-Op-Channels:

	private static NullReadableByteChannel newNullReadableByteChannel(final int size) {
		return new NullReadableByteChannel(size);
	}

	private static class NullReadableByteChannel implements ReadableByteChannel {
		private NullReadableByteChannel(final int size) {
			this.size = size;
			s = size;
		}
		int s, size;

		private boolean isClosed;

		@Override
		public boolean isOpen() {
			return !isClosed;
		}

		@Override
		public void close() throws IOException {
			this.isClosed = true;
		}

		@Override
		public int read(ByteBuffer dst) throws IOException {
			if (isClosed)
				throw new ClosedChannelException();
			if (s <= 0)
				return -1;
			final int p = dst.position();
			final int r = dst.remaining();
			final int c = Math.min(r, s);
			dst.position(p + c);
			s -= c;
			return c;
		}

		public void reset() {
			s = size;
		}
	}

	private static WritableByteChannel newNullWritableByteChannel() {
		return new WritableByteChannel() {
			private boolean isClosed;

			@Override
			public boolean isOpen() {
				return !isClosed;
			}

			@Override
			public void close() throws IOException {
				this.isClosed = true;
			}

			@Override
			public int write(ByteBuffer src) throws IOException {
				if (isClosed)
					throw new ClosedChannelException();
				final int p = src.position();
				final int r = src.remaining();
				final int c = r;
				src.position(p + c);
				return c;
			}			
		};
	}

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13387#issuecomment-1621114554