<i18n dev> RL1.1 Hex Notation

Tom Christiansen tchrist at perl.com
Thu Jan 27 15:46:15 PST 2011


Sherman wrote:

> The difference is at

>         test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
>         test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");

> You can have unpaired surrogate in Java String, but
> if you have a paired one you can't say I want them to
> be two separated "unpaired" surrogates.

I was being wicked. :)  I knew that would happen: I was taking (unfair?)
advantage of Java's regexes' smarts compared it its strings' lameness.

As I know you know, it's all because although Java *regexes* correctly
deal in true logical Unicode code points qua characters (by virtue of
copying all the code points for the pattern into an int array as the
first thing it does, effectively making regexes UTF-32ish in character),
Java's native *strings* are forever stuck with all the unfortunate
restrictions inherent to serialized UTF-16.

It has always struck me as a terribly unfortunate consequence of the 
UCS-2 => UTF-16 hack that Java should make the even more unfortunate 
programmer think constantly of annoying serialization issues instead 
of logical code points.  This decision has lead to many unfortunate 
parodoxes, including these two:

 *  A Java "char"/"Character" data type cannot hold a Unicode
    character.  More simply put, a Java "Character" cannot
    hold a Unicode "character" -- because Java does not use 
    Unicode as its native character set: it uses UTF-16.

 *  Given strings A and B, and a LENGTH function returning the
    number of code points in its string argument, neither of
    these fundamental logical guarantees can be made:

	LENGTH(A + B) == LENGTH(A) + LENGTH(B)
	LENGTH(A + B) == LENGTH(B + A)

I don't know which of those two paradoxes bothers me more; both 
make my head spin and eyes water.  They are... *unfortunate*.

I dearly, desperately wish Java strings were logical sequences of code
points instead of UTF-16 of all awful things!  If only that had been
nipped in the bud.  If only if only if only.  I also know that that
longing shall remain forever unrequited.  That doesn't stop me from
wishing it were otherwise.

Unfortunately. :(

Thank you, Sherman, for all your hard work!

--tom


More information about the i18n-dev mailing list