<i18n dev> UnicodeDecoder U+FFFE handling

Thu Jan 3 21:06:47 UTC 2019

Sounds reasonable. Filed the following issue:

https://bugs.openjdk.java.net/browse/JDK-8216140

Naoto

On 1/1/19 10:06 PM, li.jiang at oracle.com wrote:
> Sounds this request is reasonable since Unicode 7: do not consider the 
> U+FFFE in the middle of stream as malformed.
> 
> FAQ about private use characters and non-characters. [1] 
> http://www.unicode.org/faq/private_use.html
> 
> Q: Are noncharacters invalid in Unicode strings and UTFs?
> A: Absolutely not. Noncharacters do not cause a Unicode string to be 
> ill-formed in any UTF.
> 
> Q: So how should libraries and tools handle noncharacters?
> A: Library APIs, components, and tool applications (such as low-level 
> text editors) which handle all Unicode strings should also handle 
> noncharacters. Often this means simple pass-through, the same way such 
> an API or tool would handle a reserved unassigned code point.
> 
> Thanks
> Leo
> 
> On 12/24/18 3:06 AM, Clément MATHIEU wrote:
>> Hi,
>>
>> I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
>> current Unicode specification. Supsicious code is:
>>
>>         if (c == REVERSED_MARK) {
>>              // A reversed BOM cannot occur within middle of stream
>>              return CoderResult.malformedForLength(2);
>>         }
>>
>> Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
>> character and that non characters "should never been interchanged".
>> Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
>> (Unicode 6.2).
>>
>> However, Unicode 7 changed that and now says:
>>
>>        Applications are free to use any of these noncharacter code
>>        points internally. They have no standard interpretation when
>>        exchanged outside the context of internal use. However, they are
>>        not illegal in interchange, nor does their presence cause Unicode
>>        text to be ill-formed. [...] They are not prohibited from
>>        occurring  in  valid  Unicode  strings  which  happen  to  be  in
>>        terchanged. [...]. If a noncharacter is received in open
>>        interchange, an application is not required to interpret it in
>>        any way. It is good practice, however, to recognize it as a
>>        noncharacter and to take appropriate action, such as replacing it
>>        with U+FFFD replacement character, to indicate
>>        the  problem  in  the  text.  It  is  not  recommended  to  simpl
>>        y  delete  noncharacter  code points from such text, because of
>>        the potential security issues caused by deleting uninterpreted
>>        characters.
>>
>> See:
>>   - http://www.unicode.org/versions/corrigendum9.html
>>   - https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)
>>
>> Do you think that returning CR_MALFORMED is still OK?
>>
>> Regards,
>> Clément MATHIEU
>>