RFR: JDK-8021560,(str) String constructors that take ByteBuffer
Hi, Please help review the proposal to add following constructors and methods in String class to take ByteBuffer as the input and output data buffer. public String(ByteBuffer bytes, Charset cs); public String(ByteBuffer bytes, String csname); public int getBytes(byte[] dst, int offset, Charset cs); public int getBytes(byte[] dst, int offset, String csname); public int getBytes(ByteBuffer bytes, Charset cs); public int getBytes(ByteBuffer bytes, Charset csn); This is based on the change proposed by Richard in jdk9 time frame [1]. But since we were fully occupied by compact string support work back then, this one did not make into the top jdk9 priority list. issue: https://bugs.openjdk.java.net/browse/JDK-8021560 webrev: http://cr.openjdk.java.net/~sherman/8021560/webrev The implementation in StringCoding class is a little "duplicated" for various types of encoding/decoding variants. There is definitely room to further consolidate. But I would assume it might be better to keep the new code path separated from the existing "fully tuned" methods, for now, to avoid possible performance regression for existing String methods. Thanks, Sherman [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-November/029672.ht...
Is it necessary to add the "String csname" variants at this point in Java's lifetime? Don't most people use instances of Charset now? Stephen On 13 February 2018 at 06:24, Xueming Shen <xueming.shen@oracle.com> wrote:
Hi,
Please help review the proposal to add following constructors and methods in String class to take ByteBuffer as the input and output data buffer.
public String(ByteBuffer bytes, Charset cs); public String(ByteBuffer bytes, String csname); public int getBytes(byte[] dst, int offset, Charset cs); public int getBytes(byte[] dst, int offset, String csname); public int getBytes(ByteBuffer bytes, Charset cs); public int getBytes(ByteBuffer bytes, Charset csn);
This is based on the change proposed by Richard in jdk9 time frame [1]. But since we were fully occupied by compact string support work back then, this one did not make into the top jdk9 priority list.
issue: https://bugs.openjdk.java.net/browse/JDK-8021560 webrev: http://cr.openjdk.java.net/~sherman/8021560/webrev
The implementation in StringCoding class is a little "duplicated" for various types of encoding/decoding variants. There is definitely room to further consolidate. But I would assume it might be better to keep the new code path separated from the existing "fully tuned" methods, for now, to avoid possible performance regression for existing String methods.
Thanks, Sherman
[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-November/029672.ht...
On 13/02/2018 06:24, Xueming Shen wrote:
Hi,
Please help review the proposal to add following constructors and methods in String class to take ByteBuffer as the input and output data buffer.
public String(ByteBuffer bytes, Charset cs); public String(ByteBuffer bytes, String csname); These constructors looks good (for the parameter names then I assume you meant "src" rather than "bytes" here).
public int getBytes(byte[] dst, int offset, Charset cs); public int getBytes(byte[] dst, int offset, String csname); public int getBytes(ByteBuffer bytes, Charset cs); public int getBytes(ByteBuffer bytes, Charset csn); These four methods encode as many characters as possible into the destination byte[] or buffer but don't give any indication that the destination didn't have enough space to encode the entire string. I thus worry they could be a hazard and result in buggy code. If there is insufficient space then the user of the API doesn't know how many characters were encoded so it's not easy to substring and call getBytes again to encode the remaining characters. There is also the issue of how to size the destination. What would you think about having them fail when there is insufficient space? If they do fail then there is a side effect that they will have written to the destination so that would need to be documented too.
-Alan
On Tue, Feb 13, 2018 at 2:41 AM, Alan Bateman <Alan.Bateman@oracle.com> wrote:
On 13/02/2018 06:24, Xueming Shen wrote:
Hi,
Please help review the proposal to add following constructors and methods in String class to take ByteBuffer as the input and output data buffer.
public String(ByteBuffer bytes, Charset cs); public String(ByteBuffer bytes, String csname);
These constructors looks good (for the parameter names then I assume you meant "src" rather than "bytes" here).
public int getBytes(byte[] dst, int offset, Charset cs); public int getBytes(byte[] dst, int offset, String csname); public int getBytes(ByteBuffer bytes, Charset cs); public int getBytes(ByteBuffer bytes, Charset csn);
These four methods encode as many characters as possible into the destination byte[] or buffer but don't give any indication that the destination didn't have enough space to encode the entire string. I thus worry they could be a hazard and result in buggy code. If there is insufficient space then the user of the API doesn't know how many characters were encoded so it's not easy to substring and call getBytes again to encode the remaining characters. There is also the issue of how to size the destination. What would you think about having them fail when there is insufficient space? If they do fail then there is a side effect that they will have written to the destination so that would need to be documented too.
The ones that output to a ByteBuffer have more flexibility in that the buffer position can be moved according to the number of bytes written, but the method _could_ return the number of _chars_ actually written. But this is not particularly useful without variants which accept an offset into the string, unless it can be shown that s.substring(coffs).getBytes(xxx) is reasonably efficient. It might be better to shuffle this around a little and instead have a Charset[Encoder].getBytes(int codePoint, byte[] b, int offs, int len)/.getBytes(int codePoint, ByteBuffer buf) kind of thing which returns the number of bytes or e.g. -1 or -count if there isn't enough space in the target. Then it would be less onerous for users to write simple for-each-codepoint loops which encode as far as is reasonable but no farther without too many error-handling gymnastics. -- - DML
On 2/13/18, 12:41 AM, Alan Bateman wrote:
public int getBytes(byte[] dst, int offset, Charset cs); public int getBytes(byte[] dst, int offset, String csname); public int getBytes(ByteBuffer bytes, Charset cs); public int getBytes(ByteBuffer bytes, Charset csn); These four methods encode as many characters as possible into the destination byte[] or buffer but don't give any indication that the destination didn't have enough space to encode the entire string. I thus worry they could be a hazard and result in buggy code. If there is insufficient space then the user of the API doesn't know how many characters were encoded so it's not easy to substring and call getBytes again to encode the remaining characters. There is also the issue of how to size the destination. What would you think about having them fail when there is insufficient space? If they do fail then there is a side effect that they will have written to the destination so that would need to be documented too.
Well It's the old "how to control/manage the en/decoding state" issue we have to deal with for this kind of apis. I think we faced the similar issue when designing bae64 de/ encoder. A possible solution is to return negative value to indicate the buffer/byte[] provided is not big enough and its value is the length needed to de/encode the whole String. So you can pass in a zero-remaining ByteBuffer or byte[0] to ask for the size needed, or we can go even further, take a NULL, like the old C str method to return the size needed, though I don't think we want to go that far :-) That said one of the motivations of these APIs is to easy the use of doing de/encoding between ByteBuffer and String, the extra step to handle the "needed buffer size" appears to be against the original goal. But the api can always say "please provide big-enough buffer, otherwise the method returns the size needed in negative value and the passed in buffer might be partially filled ..." -Sherman
public String(ByteBuffer bytes, Charset cs); public String(ByteBuffer bytes, String csname);
I think these constructors make good sense. They avoid an extra copy to an intermediate byte[]. One issue (also mentioned by Stephen Colebourne) is whether we need the csname overload. Arguably it's not needed if we have the Charset overload. And the csname overload throws UnsupportedEncodingException, which is checked. But the csname overload is apparently faster, since the decoder can be cached, and it's unclear when this can be remedied for the Charset case.... I could go either way on this one. ** I'd also suggest adding a CharBuffer constructor: public String(CharBuffer cbuf) This would be semantically equivalent to public String(char[] value, int offset, int count) except using the chars from the CharBuffer between the buffer's position and its limit. ** Regarding the getBytes() overloads:
public int getBytes(byte[] dst, int offset, Charset cs); public int getBytes(byte[] dst, int offset, String csname); public int getBytes(ByteBuffer bytes, Charset cs); public int getBytes(ByteBuffer bytes, Charset csn);
On 2/13/18, 12:41 AM, Alan Bateman wrote:
These four methods encode as many characters as possible into the destination byte[] or buffer but don't give any indication that the destination didn't have enough space to encode the entire string. I thus worry they could be a hazard and result in buggy code. If there is insufficient space then the user of the API doesn't know how many characters were encoded so it's not easy to substring and call getBytes again to encode the remaining characters. There is also the issue of how to size the destination. What would you think about having them fail when there is insufficient space? If they do fail then there is a side effect that they will have written to the destination so that would need to be documented too.
I share Alan's concern here. If the intent is to reuse a byte[] or a ByteBuffer, then there needs to be an effective way to handle the case where the provided array/buffer doesn't have enough room to receive the decoded string. A variety of ways of dealing with this have been mentioned, such as throwing an exception; returning negative value to indicate failure, possibly also encoding the number of bytes written; or even allocating a fresh array or buffer of the proper size and returning that instead. The caller would have to check the return value and take care to handle all the cases properly. This is likely to be fairly error-prone. This also raises the question in my mind of what these getBytes() methods are intended for. On the one hand, they might be useful for the caller to manage its own memory allocation and reuse of arrays/buffers. If so, then it's necessary for intermediate results from partial processing to be handled properly. If the destination fills up, there needs to be a way to report how much of the input was consumed, so that a subsequent operation can pick up where the previous one left off. (This was one of David Lloyd's points.) If there's sufficient room in the destination, there needs to be a way to report this and how much space remains in the destination. One could contemplate adding all this information to the API. This eventually leads to CharsetEncoder.encode(CharBuffer in, ByteBuffer out, boolean endOfInput) which has all the necessary partial progress state in the buffers. On the other hand, maybe the intent of these APIs is for convenience. I'd observe that String already has this method: public byte[] getBytes(Charset) which returns the decoded bytes in a newly allocated array of the proper size. This is pretty convenient. It doesn't let the caller reuse a destination array or buffer... but that's what brings in all the partial result edge cases. Bottom line is that I'm not entirely sure of the use of these new getBytes() overloads. Maybe I've missed a use case where these work; if so, maybe somebody can describe it. s'marks
On 2/15/18, 1:55 PM, Stuart Marks wrote:
On 2/13/18, 12:41 AM, Alan Bateman wrote:
These four methods encode as many characters as possible into the destination byte[] or buffer but don't give any indication that the destination didn't have enough space to encode the entire string. I thus worry they could be a hazard and result in buggy code. If there is insufficient space then the user of the API doesn't know how many characters were encoded so it's not easy to substring and call getBytes again to encode the remaining characters. There is also the issue of how to size the destination. What would you think about having them fail when there is insufficient space? If they do fail then there is a side effect that they will have written to the destination so that would need to be documented too.
I share Alan's concern here.
If the intent is to reuse a byte[] or a ByteBuffer, then there needs to be an effective way to handle the case where the provided array/buffer doesn't have enough room to receive the decoded string. A variety of ways of dealing with this have been mentioned, such as throwing an exception; returning negative value to indicate failure,
To add the ref for the discussion. Here is one of the alternatives being discussed, to return the "-length", negative value of the space needed to encode the whole String, to hint the caller to pass in an appropriately sized buffer (only the String APIs are updated). This appears to be a better than the proposed "partial fill"/encode as many as possible approach. But it will still force the caller to take care of the return value and manage the possible buffer size issue himself, which triggers the question that whether or not the CharsetDecoder is a better place for such kind of use scenario. http://cr.openjdk.java.net/~sherman/8021560/webrev.v2 -Sherman
Hi, what is the difference of: - class StringCoding - class StringCoder - class StringCoding.StringDecoder - class StringCoding.StringEncoder Why we need so much variants of redundant code? I think it would be useful to have some explanation in the javadoc. Missing indentation in StringCoder.java : 322 if (enc == null) { 323 enc = cs.newEncoder() 324 .onMalformedInput(CodingErrorAction.REPLACE) 325 .onUnmappableCharacter(CodingErrorAction.REPLACE); 326 } -Ulf Am 16.02.2018 um 00:25 schrieb Xueming Shen:
http://cr.openjdk.java.net/~sherman/8021560/webrev.v2
-Sherman
On 2/16/18, 5:58 AM, Ulf Zibis wrote:
Hi,
what is the difference of: - class StringCoding - class StringCoder - class StringCoding.StringDecoder - class StringCoding.StringEncoder
Why we need so much variants of redundant code? I think it would be useful to have some explanation in the javadoc.
Ulf, Ignore StringCoder for now. It's a experimenting version to replace StringCoding, to reduce "some" redundant code. -Sherman
Hi gents, Thanks Sherman for taking up the bat on this one - most appreciated. These four methods encode as many characters as possible into the
destination byte[] or buffer but don't give any indication that the destination didn't have enough space to encode the entire string. I thus worry they could be a hazard and result in buggy code. If there is insufficient space then the user of the API doesn't know how many characters were encoded so it's not easy to substring and call getBytes again to encode the remaining characters. There is also the issue of how to size the destination. What would you think about having them fail when there is insufficient space? If they do fail then there is a side effect that they will have written to the destination so that would need to be documented too.
I share Alan's concern here.
If the intent is to reuse a byte[] or a ByteBuffer, then there needs to be an effective way to handle the case where the provided array/buffer doesn't have enough room to receive the decoded string. A variety of ways of dealing with this have been mentioned, such as throwing an exception; returning negative value to indicate failure,
To add the ref for the discussion. Here is one of the alternatives being discussed, to return the "-length", negative value of the space needed to encode the whole String, to hint the caller to pass in an appropriately sized buffer (only the String APIs are updated). This appears to be a better than the proposed "partial fill"/encode as many as possible approach. But it will still force the caller to take care of the return value and manage the possible buffer size issue himself, which triggers the question that whether or not the CharsetDecoder is a better place for such kind of use scenario.
I think some of the context here around application level memory management of the byte buffers is missing. The getBytes() methods were supposed to be useful in scenarios where you have a String and you want to write the byte encoded version of it down some kind of connection. So you're going to take the byte[] or ByteBuffer and hand it off somewhere - either to a streaming protocol (eg: TCP), a framed message based protocol (eg: UDP, Aeron, etc.) or perhaps to a file. A common pattern for dealing with this kind of buffer is to avoid trying to allocate a new ByteBuffer for every message and to encode onto the existing buffers before writing them into something else, for example a NIO channel. The review is completely correct that the API's user needs to know when the ByteBuffer isn't large enough to deal with the encoding, but I think there are two strategies for expected API usage here if you run out of space. 1. Find out how much space you need, grow your buffer size and encode onto the bigger buffer. So this means that in the failure case the user ideally gets to know how big a buffer you need. I think this still works in terms of mitigating per message buffer allocation as in practice it means that you only allocate a larger buffer when a String is encoded that is longer than any previous String that you've seen before. It isn't strictly necessary to know how big a buffer is needed btw - as long as failure is indicated an API user could employ a strategy like double the buffer size and retry. I think that's suboptimal to say the least, however, and knowing how big a buffer needs to be is desirable. 2. Just write the bytes that you've encoded down the stream and retry with an offset incremented by the number of characters written. This requires that the getBytes() method encodes in terms of whole characters, rather than running out of space when encoding say a character that takes up multiple bytes encoded and also takes a "source offset" parameter - say the number of characters into the String that you are? This would work perfectly well in a streaming protocol. If your buffer size is N, you encode max N characters and write them down your Channel in a retry loop. Anyone dealing with async NIO is probably familiar with the concept of having a retry loop. It may also work perfectly well in a framed message based protocol. In practice any network protocol that has fixed-size framed messages and deals with arbitrary size encodings has to have a way to fragment longer-length blobs of data into its fixed size messages. I think either strategy for dealing with failure is valid, the problem is that if the API uses the return value to indicate failure, which I think is a good idea in a low-level performance oriented API then its difficult to offer both choices to the user. (1) needs the failure return code to be the number of bytes required for encoding. (2) needs the failure return code to indicate how far into the String you are in order to retry. I suspect given this tradeoff that Sherman's suggestion of using a -length (required number of bytes) return value is a good idea and just assuming API users only attempt (1) as a solution to the too-small-buffer failure. regards, Richard Warburton http://insightfullogic.com @RichardWarburto <http://twitter.com/richardwarburto>
On 16/02/2018 22:55, Richard Warburton wrote:
: I think some of the context here around application level memory management of the byte buffers is missing. The getBytes() methods were supposed to be useful in scenarios where you have a String and you want to write the byte encoded version of it down some kind of connection. So you're going to take the byte[] or ByteBuffer and hand it off somewhere - either to a streaming protocol (eg: TCP), a framed message based protocol (eg: UDP, Aeron, etc.) or perhaps to a file. A common pattern for dealing with this kind of buffer is to avoid trying to allocate a new ByteBuffer for every message and to encode onto the existing buffers before writing them into something else, for example a NIO channel.
The review is completely correct that the API's user needs to know when the ByteBuffer isn't large enough to deal with the encoding, but I think there are two strategies for expected API usage here if you run out of space.
1. Find out how much space you need, grow your buffer size and encode onto the bigger buffer. So this means that in the failure case the user ideally gets to know how big a buffer you need. I think this still works in terms of mitigating per message buffer allocation as in practice it means that you only allocate a larger buffer when a String is encoded that is longer than any previous String that you've seen before. It isn't strictly necessary to know how big a buffer is needed btw - as long as failure is indicated an API user could employ a strategy like double the buffer size and retry. I think that's suboptimal to say the least, however, and knowing how big a buffer needs to be is desirable.
2. Just write the bytes that you've encoded down the stream and retry with an offset incremented by the number of characters written. This requires that the getBytes() method encodes in terms of whole characters, rather than running out of space when encoding say a character that takes up multiple bytes encoded and also takes a "source offset" parameter - say the number of characters into the String that you are? This would work perfectly well in a streaming protocol. If your buffer size is N, you encode max N characters and write them down your Channel in a retry loop. Anyone dealing with async NIO is probably familiar with the concept of having a retry loop. It may also work perfectly well in a framed message based protocol. In practice any network protocol that has fixed-size framed messages and deals with arbitrary size encodings has to have a way to fragment longer-length blobs of data into its fixed size messages.
I think either strategy for dealing with failure is valid, the problem is that if the API uses the return value to indicate failure, which I think is a good idea in a low-level performance oriented API then its difficult to offer both choices to the user. (1) needs the failure return code to be the number of bytes required for encoding. (2) needs the failure return code to indicate how far into the String you are in order to retry. I suspect given this tradeoff that Sherman's suggestion of using a -length (required number of bytes) return value is a good idea and just assuming API users only attempt (1) as a solution to the too-small-buffer failure.
Just to add that the existing low-level / advanced API for this is CharsetEncoder. The CoderResult from an encode and the buffer positions means you know when there is overflow, the number of characters encoded, and how many bytes were added to the buffer. It also gives fine control on how encoding errors should be handled and you cache a CharsetEncoder to avoid some of the performance anomalies that come up in the Charset vs. charset name discussions. This is not an API that most developers will ever use directly but if the use-case is advanced cases (libraries or frameworks doing their own memory management as you mention above) then it might be an alternative to look at to avoid adding advanced use-case APIs to String. I don't think an encode(String, ByteBuffer) would look out of place although it would need a way to return the characters encoded count as part of the result. -Alan.
On Sat, Feb 17, 2018 at 4:05 AM, Alan Bateman <Alan.Bateman@oracle.com> wrote:
Just to add that the existing low-level / advanced API for this is CharsetEncoder. The CoderResult from an encode and the buffer positions means you know when there is overflow, the number of characters encoded, and how many bytes were added to the buffer. It also gives fine control on how encoding errors should be handled and you cache a CharsetEncoder to avoid some of the performance anomalies that come up in the Charset vs. charset name discussions.
This is not an API that most developers will ever use directly but if the use-case is advanced cases (libraries or frameworks
Really? How else are you supposed to convert bytes <-> characters reliably in java without using CharsetEncoder/Decoder? Its the only way to get an exception when something is wrong, instead of silently masking errors with replacement characters (the JDK seems to think its doing people favors there, its not).
On 15/02/2018 21:55, Stuart Marks wrote:
:
I'd also suggest adding a CharBuffer constructor:
public String(CharBuffer cbuf)
This would be semantically equivalent to
public String(char[] value, int offset, int count)
except using the chars from the CharBuffer between the buffer's position and its limit. CharBuffer::toString already does this so adding this constructor may not be as useful as it initially looks.
-Alan
On 2/16/18 6:14 AM, Alan Bateman wrote:
On 15/02/2018 21:55, Stuart Marks wrote:
I'd also suggest adding a CharBuffer constructor:
public String(CharBuffer cbuf)
This would be semantically equivalent to
public String(char[] value, int offset, int count)
except using the chars from the CharBuffer between the buffer's position and its limit.
CharBuffer::toString already does this so adding this constructor may not be as useful as it initially looks.
Seems like an argument that this function is in the wrong location. (Not joking.) I think CharBuffer.toString() is actually quite obscure. I note that most of the Buffer classes have toString() methods that report the *status* of the buffer, e.g. java.nio.HeapByteBuffer[pos=0 lim=1000 cap=1000] Compared to other Buffers' toString() methods, CharBuffer is the outlier: its toString() produces only the contents of the CharBuffer without any meta-information. This is rather surprising. s'marks
On 16/02/2018 21:56, Stuart Marks wrote:
:
Seems like an argument that this function is in the wrong location.
(Not joking.)
I think CharBuffer.toString() is actually quite obscure. I note that most of the Buffer classes have toString() methods that report the *status* of the buffer, e.g.
java.nio.HeapByteBuffer[pos=0 lim=1000 cap=1000]
Compared to other Buffers' toString() methods, CharBuffer is the outlier: its toString() produces only the contents of the CharBuffer without any meta-information. This is rather surprising. toString() is specified by CharSequence so the CharBuffer implementation has to return a String containing the characters in the buffer.
-Alan
Hi gents, public String(ByteBuffer bytes, Charset cs);
public String(ByteBuffer bytes, String csname);
I think these constructors make good sense. They avoid an extra copy to an intermediate byte[].
One issue (also mentioned by Stephen Colebourne) is whether we need the csname overload. Arguably it's not needed if we have the Charset overload. And the csname overload throws UnsupportedEncodingException, which is checked. But the csname overload is apparently faster, since the decoder can be cached, and it's unclear when this can be remedied for the Charset case....
I could go either way on this one.
Yes - the original motivation for the csname overload was the ability to cache the decoder and also for consistency with other String encoding methods that normally have both the String name and also the Charset. If memory serves me correctly it was concluded that it was possible to cache the decoder iff the decoder could be trusted to not hold a reference to the formerly char[], now byte[], that sits within the String. Otherwise a rogue decoder could create a mutable string by holding a reference to its internal state for some malicious purpose. JDK charsets aren't rogue so it would be possible to check whether a charset was a JDK one and perform the caching, should that be a desirable + worthwhile optimization. regards, Richard Warburton http://insightfullogic.com @RichardWarburto <http://twitter.com/richardwarburto>
On 02/16/2018 03:04 PM, Richard Warburton wrote:
Hi gents,
public String(ByteBuffer bytes, Charset cs); public String(ByteBuffer bytes, String csname);
I think these constructors make good sense. They avoid an extra copy to an intermediate byte[].
One issue (also mentioned by Stephen Colebourne) is whether we need the csname overload. Arguably it's not needed if we have the Charset overload. And the csname overload throws UnsupportedEncodingException, which is checked. But the csname overload is apparently faster, since the decoder can be cached, and it's unclear when this can be remedied for the Charset case....
I could go either way on this one.
Yes - the original motivation for the csname overload was the ability to cache the decoder and also for consistency with other String encoding methods that normally have both the String name and also the Charset. If memory serves me correctly it was concluded that it was possible to cache the decoder iff the decoder could be trusted to not hold a reference to the formerly char[], now byte[], that sits within the String. Otherwise a rogue decoder could create a mutable string by holding a reference to its internal state for some malicious purpose. JDK charsets aren't rogue so it would be possible to check whether a charset was a JDK one and perform the caching, should that be a desirable + worthwhile optimization.
Yes, we don't cache external cs now (and yes, we still have to make some defensive copy her and there :-)) The StringCoder.java included in the v2 webrev is trying to cache the de/encoder from jdk's own charset. So if thing goes well as expected (rounds of jmh measurement) there should be no performance difference between csn and cs. In that case, should we drop the public String(ByteBuffer, String) constructor, as Stephen suggested? Though the caller then needs to get the charset somewhere, Charset.forName() for example, themselves. http://cr.openjdk.java.net/~sherman/8021560/StringCoder.java <http://cr.openjdk.java.net/%7Esherman/8021560/StringCoder.java> -sherman
participants (8)
-
Alan Bateman
-
David Lloyd
-
Richard Warburton
-
Robert Muir
-
Stephen Colebourne
-
Stuart Marks
-
Ulf Zibis
-
Xueming Shen