UnicodeDecoder U+FFFE handling
li.jiang at oracle.com
li.jiang at oracle.com
Wed Jan 2 06:06:25 UTC 2019
Sounds this request is reasonable since Unicode 7: do not consider the
U+FFFE in the middle of stream as malformed.
FAQ about private use characters and non-characters. [1]
http://www.unicode.org/faq/private_use.html
Q: Are noncharacters invalid in Unicode strings and UTFs?
A: Absolutely not. Noncharacters do not cause a Unicode string to be
ill-formed in any UTF.
Q: So how should libraries and tools handle noncharacters?
A: Library APIs, components, and tool applications (such as low-level
text editors) which handle all Unicode strings should also handle
noncharacters. Often this means simple pass-through, the same way such
an API or tool would handle a reserved unassigned code point.
Thanks
Leo
On 12/24/18 3:06 AM, Clément MATHIEU wrote:
> Hi,
>
> I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
> current Unicode specification. Supsicious code is:
>
> if (c == REVERSED_MARK) {
> // A reversed BOM cannot occur within middle of stream
> return CoderResult.malformedForLength(2);
> }
>
> Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
> character and that non characters "should never been interchanged".
> Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
> (Unicode 6.2).
>
> However, Unicode 7 changed that and now says:
>
> Applications are free to use any of these noncharacter code
> points internally. They have no standard interpretation when
> exchanged outside the context of internal use. However, they are
> not illegal in interchange, nor does their presence cause Unicode
> text to be ill-formed. [...] They are not prohibited from
> occurring in valid Unicode strings which happen to be in
> terchanged. [...]. If a noncharacter is received in open
> interchange, an application is not required to interpret it in
> any way. It is good practice, however, to recognize it as a
> noncharacter and to take appropriate action, such as replacing it
> with U+FFFD replacement character, to indicate
> the problem in the text. It is not recommended to simpl
> y delete noncharacter code points from such text, because of
> the potential security issues caused by deleting uninterpreted
> characters.
>
> See:
> - http://www.unicode.org/versions/corrigendum9.html
> - https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)
>
> Do you think that returning CR_MALFORMED is still OK?
>
> Regards,
> Clément MATHIEU
>
More information about the core-libs-dev
mailing list