UnicodeDecoder U+FFFE handling

Sun Dec 23 19:06:07 UTC 2018

Hi,

I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
current Unicode specification. Supsicious code is:

       if (c == REVERSED_MARK) {
            // A reversed BOM cannot occur within middle of stream
            return CoderResult.malformedForLength(2);
       }

Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
character and that non characters "should never been interchanged".
Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
(Unicode 6.2).

However, Unicode 7 changed that and now says: 

      Applications are free to use any of these noncharacter code
      points internally. They have no standard interpretation when
      exchanged outside the context of internal use. However, they are
      not illegal in interchange, nor does their presence cause Unicode
      text to be ill-formed. [...] They are not prohibited from
      occurring  in  valid  Unicode  strings  which  happen  to  be  in
      terchanged. [...]. If a noncharacter is received in open
      interchange, an application is not required to interpret it in
      any way. It is good practice, however, to recognize it as a
      noncharacter and to take appropriate action, such as replacing it
      with U+FFFD replacement character, to indicate
      the  problem  in  the  text.  It  is  not  recommended  to  simpl
      y  delete  noncharacter  code points from such text, because of
      the potential security issues caused by deleting uninterpreted
      characters.

See:
 - http://www.unicode.org/versions/corrigendum9.html
 - https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)

Do you think that returning CR_MALFORMED is still OK?

Regards,
Clément MATHIEU