JDK 11 RFR of JDK-8196995: java.lang.Character should not state UTF-16 encoding is used for strings
Joseph D. Darcy
joe.darcy at oracle.com
Thu Feb 8 22:28:01 UTC 2018
Since other people who work more closely in the area than me don't seem
to find the wording confusing or misleading, I'll just close out the bug.
Thanks,
-Joe
On 2/8/2018 11:13 AM, Xueming Shen wrote:
> On 2/8/18, 10:59 AM, joe darcy wrote:
>> Hello,
>>
>> On 2/8/2018 3:53 AM, Alan Bateman wrote:
>>> On 07/02/2018 22:12, joe darcy wrote:
>>>> Hello,
>>>>
>>>> Text in java.lang.Character states a UTF-16 character encoding is
>>>> used for java.lang.String. While was true for many years, it is not
>>>> necessarily true and not true in practice as of JDK 9 due to the
>>>> improvements from JEP 254: Compact Strings.
>>>>
>>>> The statement about the encoding should be corrected.
>>>>
>>>> Please review the patch below which does this. (I've formatted the
>>>> patch so that the change is text is made clear; I'll re-flow the
>>>> paragraph before pushing.
>>> I'm not sure that this is worth changing. You could replace
>>> "classes" with "API" and add a note to say that an implementation
>>> may use an more optimization representation but I don't think it's
>>> really needed.
>>>
>>
>> In response to this feedback and others, how about:
>>
>> [...] The Java
>> * platform uses the UTF-16 representation in {@code char} arrays and
>> - * in the {@code String} and {@code StringBuffer} classes. In
>> + * presents a UTF-16 model in the string-related API.
>>
>> IMO anyway, I think saying "uses a UTF-16 representation for String"
>> is at best misleading with the current implementation since 8 != 16
>> for the compressed representation is used for all Latin-1 strings.
>>
>
> Well, encoding/charset is the concept of a mapping between a character
> and a corresponding
> code point value. We are still using the UTF16 encoding scheme to
> represent a character in
> jvm. How to represent/store that UTF16 code point value in String
> class is an implementation
> detail. A 16-bit for "char" and a 1-byte for "latin1" (still in
> Unicode charset) + 2 byte for the
> rest in String class.
>
> As I said in my previous email. The mention of 8859-1 in the JEP might
> cause the confusion.
> At early stage of the project we were really experimenting on using
> different "encoding", including
> utf8. But the project ended up with staying with UTF-16, with a
> "customized/compressed" storage
> mechanism to store the UTF16 codepoint value.
>
> -Sherman
>
More information about the core-libs-dev
mailing list