<i18n dev> UnicodeDecoder U+FFFE handling
Naoto Sato
naoto.sato at oracle.com
Thu Jan 3 21:06:47 UTC 2019
Sounds reasonable. Filed the following issue:
https://bugs.openjdk.java.net/browse/JDK-8216140
Naoto
On 1/1/19 10:06 PM, li.jiang at oracle.com wrote:
> Sounds this request is reasonable since Unicode 7: do not consider the
> U+FFFE in the middle of stream as malformed.
>
> FAQ about private use characters and non-characters. [1]
> http://www.unicode.org/faq/private_use.html
>
> Q: Are noncharacters invalid in Unicode strings and UTFs?
> A: Absolutely not. Noncharacters do not cause a Unicode string to be
> ill-formed in any UTF.
>
> Q: So how should libraries and tools handle noncharacters?
> A: Library APIs, components, and tool applications (such as low-level
> text editors) which handle all Unicode strings should also handle
> noncharacters. Often this means simple pass-through, the same way such
> an API or tool would handle a reserved unassigned code point.
>
> Thanks
> Leo
>
> On 12/24/18 3:06 AM, Clément MATHIEU wrote:
>> Hi,
>>
>> I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
>> current Unicode specification. Supsicious code is:
>>
>> if (c == REVERSED_MARK) {
>> // A reversed BOM cannot occur within middle of stream
>> return CoderResult.malformedForLength(2);
>> }
>>
>> Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
>> character and that non characters "should never been interchanged".
>> Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
>> (Unicode 6.2).
>>
>> However, Unicode 7 changed that and now says:
>>
>> Applications are free to use any of these noncharacter code
>> points internally. They have no standard interpretation when
>> exchanged outside the context of internal use. However, they are
>> not illegal in interchange, nor does their presence cause Unicode
>> text to be ill-formed. [...] They are not prohibited from
>> occurring in valid Unicode strings which happen to be in
>> terchanged. [...]. If a noncharacter is received in open
>> interchange, an application is not required to interpret it in
>> any way. It is good practice, however, to recognize it as a
>> noncharacter and to take appropriate action, such as replacing it
>> with U+FFFD replacement character, to indicate
>> the problem in the text. It is not recommended to simpl
>> y delete noncharacter code points from such text, because of
>> the potential security issues caused by deleting uninterpreted
>> characters.
>>
>> See:
>> - http://www.unicode.org/versions/corrigendum9.html
>> - https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)
>>
>> Do you think that returning CR_MALFORMED is still OK?
>>
>> Regards,
>> Clément MATHIEU
>>
More information about the i18n-dev
mailing list