<i18n dev> RL1.7 Code Points
Masayoshi Okutsu
masayoshi.okutsu at oracle.com
Sun Jan 23 20:38:40 PST 2011
Are you talking about unpaired surrogates or something else?
Thanks,
Masayoshi
On 1/24/2011 5:22 AM, Tom Christiansen wrote:
> I am somewhat uncertain, but I believe that Java
> *almost* meets this requirement.
>
> 1.7 Code Points
>
> A fundamental requirement is that Unicode text be interpreted
> semantically by code point, not code units.
>
> RL1.7 Supplementary Code Points
>
> To meet this requirement, an implementation shall handle the full
> range of Unicode code points, including values from U+FFFF to
> U+10FFFF. In particular, where UTF-16 is used, a sequence
> consisting of a leading surrogate followed by a trailing surrogate
> shall be handled as a single code point in matching.
>
> Java tries to make things work this way, and always does so on well-formed
> input. The reason I say almost is because of the way the regex engine will
> sometimes match individual code units on ill-formed UTF-16 sequences. I
> believe this behaviour to be contrary to the fundamental requirement for
> Level 1 compliance that Unicode text never be interpreted as code units.
>
> Fortunately, this does not seem too difficult to fix, though.
>
> --tom
More information about the i18n-dev
mailing list