Codereview request for 7096080: UTF8 update and new CESU-8 charset
Ulf Zibis
Ulf.Zibis at gmx.de
Sun Oct 2 21:36:36 UTC 2011
Hi again,
Am 30.09.2011 00:27, schrieb Xueming Shen:
> On 09/29/2011 02:16 PM, Ulf Zibis wrote:
>>
>> 280 if (Character.isSurrogate(c))
>> 281 return malformedForLength(src, sp, dst, dp, 3);
>> Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be interpreted again ?
>>
Forget it! If c is a surrogate, b2 is in range A0..BF and b3 is in range 80..BF. Both can not be
potentially well-formed as a first byte.
> Actually I don't know the answer. My reading of D93a/D93b suggests that we might
> interpret it as a whole, given the bytes are actually in well-formed byte pattern range
> listed in Table 3.7, but "ill-formed" simply because they are surrogate value not scale
> value, so I would interpret the whole 3 bytes as a maximal subpart. Given D93a/b is
> "best practices for Using U+fffd", either way should be fine. We do have Unicode expert
> on the list, so maybe they can share their opinion on what is the "desired"/recommended
> behavior in this case, from Standard point view?
At line 102 you could insert:
// [E0] [A0..BF]
// [E1..EF] [80..BF]
-Ulf
More information about the core-libs-dev
mailing list