RFR: JDK-8021560,(str) String constructors that take ByteBuffer

Fri Feb 16 22:55:06 UTC 2018

Hi gents,

Thanks Sherman for taking up the bat on this one - most appreciated.

These four methods encode as many characters as possible into the
>>>> destination byte[] or buffer but don't give any indication that the
>>>> destination didn't have enough space to encode the entire string. I thus
>>>> worry they could be a hazard and result in buggy code. If there is
>>>> insufficient space then the user of the API doesn't know how many
>>>> characters were encoded so it's not easy to substring and call getBytes
>>>> again to encode the remaining characters. There is also the issue of how to
>>>> size the destination. What would you think about having them fail when
>>>> there is insufficient space? If they do fail then there is a side effect
>>>> that they will have written to the destination so that would need to be
>>>> documented too.
>>>>
>>>
>> I share Alan's concern here.
>>
>> If the intent is to reuse a byte[] or a ByteBuffer, then there needs to
>> be an effective way to handle the case where the provided array/buffer
>> doesn't have enough room to receive the decoded string. A variety of ways
>> of dealing with this have been mentioned, such as throwing an exception;
>> returning negative value to indicate failure,
>>
>
> To add the ref for the discussion. Here is one of the alternatives being
> discussed,
> to return the "-length", negative value of the space needed to encode the
> whole
> String, to hint the caller to pass in an appropriately sized buffer (only
> the String
> APIs are updated). This appears to be a better than the proposed "partial
> fill"/encode
> as many as possible approach. But it will still force the caller to take
> care of the
> return value and manage the possible buffer size issue himself, which
> triggers the
> question that whether or not the CharsetDecoder is a better place for such
> kind of
> use scenario.
>
> http://cr.openjdk.java.net/~sherman/8021560/webrev.v2
>

I think some of the context here around application level memory management
of the byte buffers is missing. The getBytes() methods were supposed to be
useful in scenarios where you have a String and you want to write the byte
encoded version of it down some kind of connection. So you're going to take
the byte[] or ByteBuffer and hand it off somewhere - either to a streaming
protocol (eg: TCP), a framed message based protocol (eg: UDP, Aeron, etc.)
or perhaps to a file. A common pattern for dealing with this kind of buffer
is to avoid trying to allocate a new ByteBuffer for every message and to
encode onto the existing buffers before writing them into something else,
for example a NIO channel.

The review is completely correct that the API's user needs to know when the
ByteBuffer isn't large enough to deal with the encoding, but I think there
are two strategies for expected API usage here if you run out of space.

1. Find out how much space you need, grow your buffer size and encode onto
the bigger buffer. So this means that in the failure case the user ideally
gets to know how big a buffer you need. I think this still works in terms
of mitigating per message buffer allocation as in practice it means that
you only allocate a larger buffer when a String is encoded that is longer
than any previous String that you've seen before. It isn't strictly
necessary to know how big a buffer is needed btw - as long as failure is
indicated an API user could employ a strategy like double the buffer size
and retry. I think that's suboptimal to say the least, however, and knowing
how big a buffer needs to be is desirable.

2. Just write the bytes that you've encoded down the stream and retry with
an offset incremented by the number of characters written. This requires
that the getBytes() method encodes in terms of whole characters, rather
than running out of space when encoding say a character that takes up
multiple bytes encoded and also takes a "source offset" parameter - say the
number of characters into the String that you are? This would work
perfectly well in a streaming protocol. If your buffer size is N, you
encode max N characters and write them down your Channel in a retry loop.
Anyone dealing with async NIO is probably familiar with the concept of
having a retry loop. It may also work perfectly well in a framed message
based protocol. In practice any network protocol that has fixed-size framed
messages and deals with arbitrary size encodings has to have a way to
fragment longer-length blobs of data into its fixed size messages.

I think either strategy for dealing with failure is valid, the problem is
that if the API uses the return value to indicate failure, which I think is
a good idea in a low-level performance oriented API then its difficult to
offer both choices to the user. (1) needs the failure return code to be the
number of bytes required for encoding. (2) needs the failure return code to
indicate how far into the String you are in order to retry. I suspect given
this tradeoff that Sherman's suggestion of using a -length (required number
of bytes) return value is a good idea and just assuming API users only
attempt (1) as a solution to the too-small-buffer failure.

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto <http://twitter.com/richardwarburto>