RFR [8058875]: CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8

Mark Thomas markt at apache.org
Mon Sep 22 21:34:53 UTC 2014


On 22/09/2014 22:23, Martin Buchholz wrote:
> I think you are mistaken. It's maxBytesPerChar, not maxBytesPerCodepoint!

You are going to have to explain that some more. The Javadoc for
CharsetEncoder.maxBytesPerChar() is explicit:
<quote>
Returns the maximum number of bytes that will be produced for each
character of input.
</quote>

For UTF-8 that number is 4, not 3. A quick look at the source for the
default UTF-8 encoder confirms that there are cases where it will output
4 bytes for a single input character.

Mark


> 
> 
> changeset:   3116:b44704ce8a08
> user:        sherman
> date:        2010-11-19 12:58 -0800
> 6957230: CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be 3
> Summary: changged utf-8's CharsetEncoder.maxBytesPerChar to 3
> Reviewed-by: alanb
> 
> 
> On Mon, Sep 22, 2014 at 1:14 PM, Ivan Gerasimov <ivan.gerasimov at oracle.com>
> wrote:
> 
>> Hello!
>>
>> The UTF-8 encoding allows characters that are 4 bytes long.
>> However, CharsetEncoder.maxBytesPerChar() currently returns 3.0, which is
>> not always enough.
>>
>> Would you please review the simple fix for this issue?
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8058875
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8058875/0/webrev/
>>
>> Sincerely yours,
>> Ivan
>>




More information about the core-libs-dev mailing list