RFR [8058875]: CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8

Mon Sep 22 21:42:43 UTC 2014

On 22/09/2014 22:46, Xueming Shen wrote:
> On 09/22/2014 01:14 PM, Ivan Gerasimov wrote:
>> Hello!
>>
>> The UTF-8 encoding allows characters that are 4 bytes long.
>> However, CharsetEncoder.maxBytesPerChar() currently returns 3.0, which
>> is not always enough.
>>
>> Would you please review the simple fix for this issue?
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8058875
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8058875/0/webrev/
>>
>> Sincerely yours,
>> Ivan
> 
> The "character" in the nio Charset and CharDe/Encoder is specified as
> "sixteen-bit Unicode
> code unit", so it is reasonable to interpret the "character" in the
> "maximum number of bytes
> that will be produced for each character of input" to be the Java "char"
> as well. In case of
> UTF8, each 4-byte form supplementary character is always coded into 2
> surrogate chars,
> it's "2 byte per char". Do we have a real escalation that complains
> about this?

Ah. Got it. I see now. There are single chars that will result in 3
bytes of output but to get 4 bytes of output requires 2 chars of input.

In which case the current value of 3.0 makes sense.

Cheers,

Mark