RFR: JDK-8021560,(str) String constructors that take ByteBuffer

Sat Feb 17 09:05:54 UTC 2018

On 16/02/2018 22:55, Richard Warburton wrote:
> :
> I think some of the context here around application level memory management
> of the byte buffers is missing. The getBytes() methods were supposed to be
> useful in scenarios where you have a String and you want to write the byte
> encoded version of it down some kind of connection. So you're going to take
> the byte[] or ByteBuffer and hand it off somewhere - either to a streaming
> protocol (eg: TCP), a framed message based protocol (eg: UDP, Aeron, etc.)
> or perhaps to a file. A common pattern for dealing with this kind of buffer
> is to avoid trying to allocate a new ByteBuffer for every message and to
> encode onto the existing buffers before writing them into something else,
> for example a NIO channel.
>
> The review is completely correct that the API's user needs to know when the
> ByteBuffer isn't large enough to deal with the encoding, but I think there
> are two strategies for expected API usage here if you run out of space.
>
> 1. Find out how much space you need, grow your buffer size and encode onto
> the bigger buffer. So this means that in the failure case the user ideally
> gets to know how big a buffer you need. I think this still works in terms
> of mitigating per message buffer allocation as in practice it means that
> you only allocate a larger buffer when a String is encoded that is longer
> than any previous String that you've seen before. It isn't strictly
> necessary to know how big a buffer is needed btw - as long as failure is
> indicated an API user could employ a strategy like double the buffer size
> and retry. I think that's suboptimal to say the least, however, and knowing
> how big a buffer needs to be is desirable.
>
> 2. Just write the bytes that you've encoded down the stream and retry with
> an offset incremented by the number of characters written. This requires
> that the getBytes() method encodes in terms of whole characters, rather
> than running out of space when encoding say a character that takes up
> multiple bytes encoded and also takes a "source offset" parameter - say the
> number of characters into the String that you are? This would work
> perfectly well in a streaming protocol. If your buffer size is N, you
> encode max N characters and write them down your Channel in a retry loop.
> Anyone dealing with async NIO is probably familiar with the concept of
> having a retry loop. It may also work perfectly well in a framed message
> based protocol. In practice any network protocol that has fixed-size framed
> messages and deals with arbitrary size encodings has to have a way to
> fragment longer-length blobs of data into its fixed size messages.
>
> I think either strategy for dealing with failure is valid, the problem is
> that if the API uses the return value to indicate failure, which I think is
> a good idea in a low-level performance oriented API then its difficult to
> offer both choices to the user. (1) needs the failure return code to be the
> number of bytes required for encoding. (2) needs the failure return code to
> indicate how far into the String you are in order to retry. I suspect given
> this tradeoff that Sherman's suggestion of using a -length (required number
> of bytes) return value is a good idea and just assuming API users only
> attempt (1) as a solution to the too-small-buffer failure.
>
Just to add that the existing low-level / advanced API for this is 
CharsetEncoder. The CoderResult from an encode and the buffer positions 
means you know when there is overflow, the number of characters encoded, 
and how many bytes were added to the buffer. It also gives fine control 
on how encoding errors should be handled and you cache a CharsetEncoder 
to avoid some of the performance anomalies that come up in the Charset 
vs. charset name discussions. This is not an API that most developers 
will ever use directly but if the use-case is advanced cases (libraries 
or frameworks doing their own memory management as you mention above) 
then it might be an alternative to look at to avoid adding advanced 
use-case APIs to String. I don't think an encode(String, ByteBuffer) 
would look out of place although it would need a way to return the 
characters encoded count as part of the result.

-Alan.