<i18n dev> RL1.1 Hex Notation
Tom Christiansen
tchrist at perl.com
Thu Jan 27 15:46:15 PST 2011
Sherman wrote:
> The difference is at
> test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
> test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");
> You can have unpaired surrogate in Java String, but
> if you have a paired one you can't say I want them to
> be two separated "unpaired" surrogates.
I was being wicked. :) I knew that would happen: I was taking (unfair?)
advantage of Java's regexes' smarts compared it its strings' lameness.
As I know you know, it's all because although Java *regexes* correctly
deal in true logical Unicode code points qua characters (by virtue of
copying all the code points for the pattern into an int array as the
first thing it does, effectively making regexes UTF-32ish in character),
Java's native *strings* are forever stuck with all the unfortunate
restrictions inherent to serialized UTF-16.
It has always struck me as a terribly unfortunate consequence of the
UCS-2 => UTF-16 hack that Java should make the even more unfortunate
programmer think constantly of annoying serialization issues instead
of logical code points. This decision has lead to many unfortunate
parodoxes, including these two:
* A Java "char"/"Character" data type cannot hold a Unicode
character. More simply put, a Java "Character" cannot
hold a Unicode "character" -- because Java does not use
Unicode as its native character set: it uses UTF-16.
* Given strings A and B, and a LENGTH function returning the
number of code points in its string argument, neither of
these fundamental logical guarantees can be made:
LENGTH(A + B) == LENGTH(A) + LENGTH(B)
LENGTH(A + B) == LENGTH(B + A)
I don't know which of those two paradoxes bothers me more; both
make my head spin and eyes water. They are... *unfortunate*.
I dearly, desperately wish Java strings were logical sequences of code
points instead of UTF-16 of all awful things! If only that had been
nipped in the bud. If only if only if only. I also know that that
longing shall remain forever unrequited. That doesn't stop me from
wishing it were otherwise.
Unfortunately. :(
Thank you, Sherman, for all your hard work!
--tom
More information about the i18n-dev
mailing list