JDK 11 RFR of JDK-8196995: java.lang.Character should not state UTF-16 encoding is used for strings

Thu Feb 8 19:13:33 UTC 2018

On 2/8/18, 10:59 AM, joe darcy wrote:
> Hello,
>
> On 2/8/2018 3:53 AM, Alan Bateman wrote:
>> On 07/02/2018 22:12, joe darcy wrote:
>>> Hello,
>>>
>>> Text in java.lang.Character states a UTF-16 character encoding is 
>>> used for java.lang.String. While was true for many years, it is not 
>>> necessarily true and not true in practice as of JDK 9 due to the 
>>> improvements from JEP 254: Compact Strings.
>>>
>>> The statement about the encoding should be corrected.
>>>
>>> Please review the patch below which does this. (I've formatted the 
>>> patch so that the change is text is made clear; I'll re-flow the 
>>> paragraph before pushing.
>> I'm not sure that this is worth changing. You could replace "classes" 
>> with "API" and add a note to say that an implementation may use an 
>> more optimization representation but I don't think it's really needed.
>>
>
> In response to this feedback and others, how about:
>
>      [...] The Java
>   * platform uses the UTF-16 representation in {@code char} arrays and
> - * in the {@code String} and {@code StringBuffer} classes. In
> + * presents a UTF-16 model in the string-related API.
>
> IMO anyway, I think saying "uses a UTF-16 representation for String" 
> is at best misleading with the current implementation since 8 != 16 
> for the compressed representation is used for all Latin-1 strings.
>

Well, encoding/charset is the concept of a mapping between a character 
and a corresponding
code point value. We are still using the UTF16 encoding scheme to 
represent a character in
jvm. How to represent/store that UTF16 code point value in String class 
is an implementation
detail. A 16-bit for "char"  and a 1-byte for "latin1" (still in Unicode 
charset) + 2 byte for the
rest in String class.

As I said in my previous email. The mention of 8859-1 in the JEP might 
cause the confusion.
At early stage of the project we were really experimenting on using 
different "encoding", including
utf8. But the project ended up with staying with UTF-16, with a 
"customized/compressed" storage
mechanism to store the UTF16 codepoint value.

-Sherman