String encoding to ByteBuffer

Carl M java at rkive.org
Sun Feb 26 23:39:57 UTC 2023


I'm looking into adding a fast path case for encoding Strings into ByteBuffers, and wanted to get feedback on a possible approach.  My use case is taking mostly-ASCII, UTF-8 Strings and writing them to the disk/network.  To do this today, there are two approaches which both have drawbacks:

1.  Use String.getBytes(StandardCharsets.UTF_8), and call ByteBuffer.put().  The downside of this approach is that I need to make a copy of the String's byte[] value.    The upside of this approach is that ByteBuffer uses the intrinsic copy methods, which are fast.

2.  Wrap the String in a CharBuffer, and call CharsetEncoder.encode(CharBuffer, ByteBuffer).  This avoids copying the String value.  However, when using the UTF_8 encoder, there is no fastpath for writing to direct ByteBuffers.   sun.nio.cs.UTF_8.encodeLoop() only has fast paths for when the destination is array based.  This allocates less memory, but is overall slower in my JMH benchmark.

To fix this, I looked at adding an overload to CharsetEncoder to accept a String (or a CharSequence), and a ByteBuffer as a destination.  However, this is not easily doable, since it's hard to call it in a loop.  In the case that the String overflows the BB, the caller needs to be able to provide a new BB and resume from where they left off.  The CharBuffer approach works here because it keeps the position last read, and can resume from there.  

To encode a String, we need to know that the character index written to resume with a larger buffer.  However, the return type on CharsetEncoder's encode method is a CoderResult.  The length() method on this can't be called for underflow cases.  This means that there isn't a usable return type here (neither int nor CoderResult can be used).

Another, almost-possible solution I was considering adding a special case to UTF_8 for direct buffer destinations, and a corresponding JLA.encodeASCII overload that accepts a ByteBuffer.  The challenge here is that a wrapped CharBuffer doesn't have an array, and so doesn't get the fast path copying.

The reason I am reaching out here is that I am looking for feedback on my analysis of the existing API.  I am wondering what API compromises could be made to fast path writing Strings to direct buffers, which I feel is probably a common operation.  The only reasonable way I can see to implement is a new return type, which also seems undesirable as well.

Carl


More information about the core-libs-dev mailing list