Codereview request for 7096080: UTF8 update and new CESU-8 charset

Sun Oct 2 21:36:36 UTC 2011

Hi again,

Am 30.09.2011 00:27, schrieb Xueming Shen:
> On 09/29/2011 02:16 PM, Ulf Zibis wrote:
>>
>>  280                     if (Character.isSurrogate(c))
>>  281                         return malformedForLength(src, sp, dst, dp, 3);
>> Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be interpreted again ?
>>
Forget it! If c is a surrogate, b2 is in range A0..BF and b3 is in range 80..BF. Both can not be 
potentially well-formed as a first byte.

> Actually I don't know the answer. My reading of D93a/D93b suggests that we might
> interpret it as a whole, given the bytes are actually in well-formed byte pattern range
> listed in Table 3.7, but "ill-formed" simply because they are surrogate value not scale
> value, so I would interpret the whole 3 bytes as a maximal subpart. Given D93a/b is
> "best practices for Using U+fffd", either way should be fine. We do have Unicode expert
> on the list, so maybe they can share their opinion on what is the "desired"/recommended
> behavior in this case, from Standard point view?

At line 102 you could insert:
         //  [E0]     [A0..BF]
         //  [E1..EF] [80..BF]

-Ulf