<i18n dev> Codereview request for 7096080: UTF8 update and new CESU-8 charset

Thu Sep 29 15:27:46 PDT 2011

On 09/29/2011 02:16 PM, Ulf Zibis wrote:
> Please use spaces with ternary operators: Lines 155, 216
>
> For short you could use sr instead srcRemaining, consistent to sa, sp, sl.
>
>  420         // returns -1 if there is malformed byte(s) and the
> better:
>  420         // returns -1 if there is/are malformed byte(s) and the
>
>  466                             sp -=3;
> There should be a space:  sp -= 3;

Webrev has been updated accordingly.

>
>  280                     if (Character.isSurrogate(c))
>  281                         return malformedForLength(src, sp, dst, 
> dp, 3);
> Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be 
> interpreted again ?
>

Actually I don't know the answer. My reading of D93a/D93b suggests that 
we might
interpret it as a whole, given the bytes are actually in well-formed 
byte pattern range
listed in Table 3.7, but "ill-formed" simply because they are surrogate 
value not scale
value, so I would interpret the whole 3 bytes as a maximal subpart. 
Given D93a/b is
"best practices for Using U+fffd", either way should be fine. We do have 
Unicode expert
on the list, so maybe they can share their opinion on what is the 
"desired"/recommended
behavior in this case, from Standard point view?

>
> Am 29.09.2011 05:27, schrieb Xueming Shen:
>> Hi,
>>
>> On 9/28/2011 3:44 PM, Ulf Zibis wrote:
>>> 5. IMHO charset CESU-8 should be hosted in extended-charsets, 
>>> otherwise it should be added to java.nio.StandardCharsets
>>>
>>
>> We have lots of charsets provided via the "standard charset provider" 
>> (in rt.jar) but not listed in StandardCharsets.
> Yes, but the reasonable to add CESU-8 to StandardCharsets was the 
> supposed demand to treat all unicode charsets equivalent.
>
> Otherwise there is no obstacle to host CESU-8 in extended-charsets.
> IMHO, CESU-8 addresses corner case compatibility issues, but not 
> "standard" requirements.

To put CESU-8 into "standard charset provider" (it is only an 
implementation details) does
not mean it is a "standard" requirement, it just means it is bundled 
into rt.jar. The reason
I put it there is to make sure it is together with the UTF-8, with the 
assumption is that you
might need it around when using the updated UTF-8, which no longer 
handles those 3/6-byte
surrogates.

-Sherman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110929/1d16f48f/attachment.html