RFR [8058875]: CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8

Tue Sep 23 14:58:54 UTC 2014

This response confuses me.  Are you saying that the UTF8 encoder is not really producing UTF8?  RFC 2279 and 3629 both clearly state that surrogates must be combined to form a 32-bit value which is then encoded as a 4-byte sequence.  In fact, the RFCs refer to the alternate encoding CESU_8 definition which encodes each half of the surrogate pair as a 3-byte UTF-8 sequence.

I guess returning 3.0 for maxBytesPerChar works for the purpose of allocating a big enough byte buffer, but the explanation in this thread is confusing.

Tom Salter

------------------------------
Date: Tue, 23 Sep 2014 11:37:07 +0400
From: Ivan Gerasimov <ivan.gerasimov at oracle.com>
To: Xueming Shen <xueming.shen at oracle.com>,	Martin Buchholz
	<martinrb at google.com>
Cc: nio-dev at openjdk.java.net, core-libs-dev
	<core-libs-dev at openjdk.java.net>
Subject: Re: RFR [8058875]: CharsetEncoder.maxBytesPerChar() should
	return	4 for UTF-8
Message-ID: <54212323.5080907 at oracle.com>
Content-Type: text/plain; charset=UTF-8; format=flowed

Martin, Sherman thanks for clarification!

Closing the bug as not a bug.

> The "character" in the nio Charset and CharDe/Encoder is specified as 
> "sixteen-bit Unicode
> code unit", so it is reasonable to interpret the "character" in the 
> "maximum number of bytes
> that will be produced for each character of input" to be the Java 
> "char" as well. In case of
> UTF8, each 4-byte form supplementary character is always coded into 2 
> surrogate chars,
> it's "2 byte per char".

> Do we have a real escalation that complains about this?
>
Yes, the link in on the bug page: 
https://bugs.openjdk.java.net/browse/JDK-8058875
I'm going to try to explain what I've just realized about this function :-)

Sincerely yours,
Ivan