<i18n dev> RL1.7 Code Points

Sun Jan 23 12:22:23 PST 2011

I am somewhat uncertain, but I believe that Java 
*almost* meets this requirement.

    1.7 Code Points

    A fundamental requirement is that Unicode text be interpreted
    semantically by code point, not code units.

    RL1.7	Supplementary Code Points

        To meet this requirement, an implementation shall handle the full
        range of Unicode code points, including values from U+FFFF to
        U+10FFFF. In particular, where UTF-16 is used, a sequence
        consisting of a leading surrogate followed by a trailing surrogate
        shall be handled as a single code point in matching.

Java tries to make things work this way, and always does so on well-formed
input.  The reason I say almost is because of the way the regex engine will
sometimes match individual code units on ill-formed UTF-16 sequences.  I
believe this behaviour to be contrary to the fundamental requirement for
Level 1 compliance that Unicode text never be interpreted as code units.

Fortunately, this does not seem too difficult to fix, though.

--tom