UnicodeDecoder U+FFFE handling
Clément MATHIEU
clement at unportant.info
Sun Dec 23 19:06:07 UTC 2018
Hi,
I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
current Unicode specification. Supsicious code is:
if (c == REVERSED_MARK) {
// A reversed BOM cannot occur within middle of stream
return CoderResult.malformedForLength(2);
}
Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
character and that non characters "should never been interchanged".
Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
(Unicode 6.2).
However, Unicode 7 changed that and now says:
Applications are free to use any of these noncharacter code
points internally. They have no standard interpretation when
exchanged outside the context of internal use. However, they are
not illegal in interchange, nor does their presence cause Unicode
text to be ill-formed. [...] They are not prohibited from
occurring in valid Unicode strings which happen to be in
terchanged. [...]. If a noncharacter is received in open
interchange, an application is not required to interpret it in
any way. It is good practice, however, to recognize it as a
noncharacter and to take appropriate action, such as replacing it
with U+FFFD replacement character, to indicate
the problem in the text. It is not recommended to simpl
y delete noncharacter code points from such text, because of
the potential security issues caused by deleting uninterpreted
characters.
See:
- http://www.unicode.org/versions/corrigendum9.html
- https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)
Do you think that returning CR_MALFORMED is still OK?
Regards,
Clément MATHIEU
More information about the core-libs-dev
mailing list