<i18n dev> UnicodeDecoder U+FFFE handling

Wed Jan 2 06:06:25 UTC 2019

Sounds this request is reasonable since Unicode 7: do not consider the 
U+FFFE in the middle of stream as malformed.

FAQ about private use characters and non-characters. [1] 
http://www.unicode.org/faq/private_use.html

Q: Are noncharacters invalid in Unicode strings and UTFs?
A: Absolutely not. Noncharacters do not cause a Unicode string to be 
ill-formed in any UTF.

Q: So how should libraries and tools handle noncharacters?
A: Library APIs, components, and tool applications (such as low-level 
text editors) which handle all Unicode strings should also handle 
noncharacters. Often this means simple pass-through, the same way such 
an API or tool would handle a reserved unassigned code point.

Thanks
Leo

On 12/24/18 3:06 AM, Clément MATHIEU wrote:
> Hi,
> 
> I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
> current Unicode specification. Supsicious code is:
> 
>         if (c == REVERSED_MARK) {
>              // A reversed BOM cannot occur within middle of stream
>              return CoderResult.malformedForLength(2);
>         }
> 
> Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
> character and that non characters "should never been interchanged".
> Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
> (Unicode 6.2).
> 
> However, Unicode 7 changed that and now says:
> 
>        Applications are free to use any of these noncharacter code
>        points internally. They have no standard interpretation when
>        exchanged outside the context of internal use. However, they are
>        not illegal in interchange, nor does their presence cause Unicode
>        text to be ill-formed. [...] They are not prohibited from
>        occurring  in  valid  Unicode  strings  which  happen  to  be  in
>        terchanged. [...]. If a noncharacter is received in open
>        interchange, an application is not required to interpret it in
>        any way. It is good practice, however, to recognize it as a
>        noncharacter and to take appropriate action, such as replacing it
>        with U+FFFD replacement character, to indicate
>        the  problem  in  the  text.  It  is  not  recommended  to  simpl
>        y  delete  noncharacter  code points from such text, because of
>        the potential security issues caused by deleting uninterpreted
>        characters.
> 
> See:
>   - http://www.unicode.org/versions/corrigendum9.html
>   - https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)
> 
> Do you think that returning CR_MALFORMED is still OK?
> 
> Regards,
> Clément MATHIEU
>