<i18n dev> RL1.7 Code Points

Sun Jan 23 20:38:40 PST 2011

Are you talking about unpaired surrogates or something else?

Thanks,
Masayoshi

On 1/24/2011 5:22 AM, Tom Christiansen wrote:
> I am somewhat uncertain, but I believe that Java
> *almost* meets this requirement.
>
>      1.7 Code Points
>
>      A fundamental requirement is that Unicode text be interpreted
>      semantically by code point, not code units.
>
>      RL1.7	Supplementary Code Points
>
>          To meet this requirement, an implementation shall handle the full
>          range of Unicode code points, including values from U+FFFF to
>          U+10FFFF. In particular, where UTF-16 is used, a sequence
>          consisting of a leading surrogate followed by a trailing surrogate
>          shall be handled as a single code point in matching.
>
> Java tries to make things work this way, and always does so on well-formed
> input.  The reason I say almost is because of the way the regex engine will
> sometimes match individual code units on ill-formed UTF-16 sequences.  I
> believe this behaviour to be contrary to the fundamental requirement for
> Level 1 compliance that Unicode text never be interpreted as code units.
>
> Fortunately, this does not seem too difficult to fix, though.
>
> --tom