JDK 11 RFR of JDK-8196995: java.lang.Character should not state UTF-16 encoding is used for strings

Thu Feb 8 22:28:01 UTC 2018

Since other people who work more closely in the area than me don't seem 
to find the wording confusing or misleading, I'll just close out the bug.

Thanks,

-Joe

On 2/8/2018 11:13 AM, Xueming Shen wrote:
> On 2/8/18, 10:59 AM, joe darcy wrote:
>> Hello,
>>
>> On 2/8/2018 3:53 AM, Alan Bateman wrote:
>>> On 07/02/2018 22:12, joe darcy wrote:
>>>> Hello,
>>>>
>>>> Text in java.lang.Character states a UTF-16 character encoding is 
>>>> used for java.lang.String. While was true for many years, it is not 
>>>> necessarily true and not true in practice as of JDK 9 due to the 
>>>> improvements from JEP 254: Compact Strings.
>>>>
>>>> The statement about the encoding should be corrected.
>>>>
>>>> Please review the patch below which does this. (I've formatted the 
>>>> patch so that the change is text is made clear; I'll re-flow the 
>>>> paragraph before pushing.
>>> I'm not sure that this is worth changing. You could replace 
>>> "classes" with "API" and add a note to say that an implementation 
>>> may use an more optimization representation but I don't think it's 
>>> really needed.
>>>
>>
>> In response to this feedback and others, how about:
>>
>>      [...] The Java
>>   * platform uses the UTF-16 representation in {@code char} arrays and
>> - * in the {@code String} and {@code StringBuffer} classes. In
>> + * presents a UTF-16 model in the string-related API.
>>
>> IMO anyway, I think saying "uses a UTF-16 representation for String" 
>> is at best misleading with the current implementation since 8 != 16 
>> for the compressed representation is used for all Latin-1 strings.
>>
>
> Well, encoding/charset is the concept of a mapping between a character 
> and a corresponding
> code point value. We are still using the UTF16 encoding scheme to 
> represent a character in
> jvm. How to represent/store that UTF16 code point value in String 
> class is an implementation
> detail. A 16-bit for "char"  and a 1-byte for "latin1" (still in 
> Unicode charset) + 2 byte for the
> rest in String class.
>
> As I said in my previous email. The mention of 8859-1 in the JEP might 
> cause the confusion.
> At early stage of the project we were really experimenting on using 
> different "encoding", including
> utf8. But the project ended up with staying with UTF-16, with a 
> "customized/compressed" storage
> mechanism to store the UTF16 codepoint value.
>
> -Sherman
>