Lower overhead String encoding/decoding

Thu Sep 25 16:44:39 UTC 2014

On 9/25/14 9:24 AM, Xueming Shen wrote:
> Hi Richard, couple comments after a quick scan.
>
> (1)  #474, the IOBE probably is no longer needed in case of 
> ByteBuffer. parameter.
>
> (2) for methods that have the ByteBuffer as input, it would be 
> desirable to specify clearly that
>      the bytes are read from "position" to "limit", and whether or not 
> the "position" will be advanced
>      after decoding/encoding (important).
>
> (3) getBytes(byte[], offset, charset)
>      while I understand it might be useful to have such a method in 
> certain circumstance, it is
>      usually complicated to make it right/easy for user to actually 
> use it. Consider the fact that
>      the returned number of bytes encoded has no straightforward 
> connection to how many
>      underlying chars have been encoded (so the user of this method 
> really have no idea how
>      many underlying "chars" have been encoded into the dest buffer, 
> is that enough? how big
>      the buffer need to be to encode the whole string? ....) 
> especially the possibility that the last
>      couple byte space might be short of encoding a "char". Not like 
> the getChars(), which has
>      a easy, clear and direct link between the out going chars and 
> underlying chars. I would
>      suggest it might be better to leave it out.

I would assume this is true for the ByteBuffer version as well...just 
doubt the usefulness of
a method that it might only give you the bytes of part of the target 
string object. Given the
getBytes(ByteBuffer) version has the additional position/limit info form 
the buffer itself, a
"workaround" is to return the "number of underlying chars copied"...then 
you need one more
parameter to let user to copy from "index' for next round.

-sherman

>
> (4) StringCoding.decode() #239  "remaining()" should be used to return 
> limit - position?
>
> (5) in case of "untrusted", it might be more straightforward to get 
> all "bytes" out of the buffer
>      first (you are allocating a byte buffer here anyway, I don;t see 
> obvious benefit to get a
>      direct buffer, btw) and then pass it to the original/existing 
> byte[]->char[] decoding
>      implementation. We probably will take a deep look at the 
> implementation later when
>      the public api settled.
>
> -Sherman
>
>
> Richard Warburton wrote:
>> Hi Alan,
>>
>> Thanks for the feedback.
>>
>> The direction seems reasonable but I wonder about the offset/length (and
>>> destOffet) parameters as this isn't consistent with how ByteBuffers 
>>> were
>>> originally intended to be used. That is, when you read the bytes 
>>> from the
>>> wire into a ByteBuffer and flip it then the position and limit will 
>>> delimit
>>> the bytes to be decoded.
>>>
>>> If the constructor is changed to String(ByteBuffer in, Charset cs) and
>>> decodes the remaining bytes in the buffer to a String using the 
>>> specified
>>> Charset then I think would be more consistent. Also I think this 
>>> would give
>>> you a solution to the underflow case.
>>>
>> I've updated the webrev to reflect this, removing the offset and length
>> parameters and using position() and limit() instead.
>>
>> http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-6/
>>
>> Similarly if getBytes is replaced with with a getBytes or
>>> encode(ByteBuffer, Charset cs) then then it would encode as many 
>>> characters
>>> as possible into the output buffer and I think would be more 
>>> consistent and
>>> also help with overflow case.
>>>
>> I've also applied the this to the getBytes() method. I chose the 
>> getBytes()
>> method name for consistency with the existing getBytes() method that
>> returns a byte[]. To my mind encode() is a more natural name for the
>> method, which you mention in your email, do people have a preference 
>> here?
>>
>> regards,
>>
>>    Richard Warburton
>>
>>    http://insightfullogic.com
>>    @RichardWarburto <http://twitter.com/richardwarburto>
>