<i18n dev> regex rewriting code (part 1 of 3)

Tue Jan 25 10:52:41 PST 2011

Sherman, referring to Java's ASCII-only senses of \w and \s,
and of \p{alpha} and \p{space}, wrote:

> (does Perl 5 work in this way as well?)

No, not for a very, very long time.  For most of Perl's life,
charclass escapes like \w have always been Unicode aware.

However, it did take us some time to separate out the POSIX
names from the Unicode properties.  As I've mentioned, we
eventually solved this by prepending "POSIX" to the name of the
property.  So \p{POSIX_Alpha} gets you exactly what POSIX says--
which in Perl is locale-aware; I don't believe this is true of
Java, though.  Whereas \p{Alpha} gets you \p{Alphabetic} as
RL1.2a requires under both recommendations.

As for \w and such, Perl defines \w, \d, \s, and \b -- and their
uppercase complements -- to work exactly as the definitions given
Annex C of tr18's RL1.2a state that they work per the Standard
Recommendation.

For \d, there is some flexibility in that the POSIX Compatible version
allows it to match only [0-9] instead of all of \p{Decimal_Number}.

To meet the requirements of RL1.2a, one must state whether one is using
Standard Recommendation or the POSIX Compatible version.  Only for \d
does Java use either of the allowable senses.  The others all choose
their own definitions which are out of compliance with RL1.2a.  (And
Java does not support Annex C's \X at all.  I know that that one
is on your own personal wish-list, Sherman.)

All of this is what first motivated me to write a drop-in
replacement that preprocesses Pattern strings to allow them to
work properly (read: per RL1.2a and others) on Unicode strings.

And it was because of that code that you first became known to
me, and vice versa.  So I think I should discuss it a bit.  
That's what parts 2 and 3 will be about.

> This is by design and I don't agree "this is a mess" conclusion.

Sherman, you're right that just because things like \w and \s,
or \p{alpha} or \p{space}, do not meet the requirements of RL1.2a
does not lead one to conclude that "this is a mess."  That would
be grossly overstating matters.  Alone, it is simply non-conformant,
not a mess.

What I meant was a mess was the mismatch between \w and \b.
It is this mismatch that makes possible nonsense results like
I wrote about here:

    One fundamental bug is that Java has misunderstood the connection
    between \b and \w regexes, so that now a string like "élève" is not
    matched by the pattern "\b\w+\b" at any point in the string.

It turns out that because of this, Java is out of compliance with
any of the permissible senses of \b and \w given in tr18.  I will
demonstrate that in part 3 of this letter, as well as provide code
demonstrating a remedy.

> While there are developers over there might like these properties to
> evolve to be the Unicode properties, I am pretty much sure there might
> be the same amount of developers there would prefer these properties
> be kept as the "original" POSIX properties.

My experience suggests that you are indeed correct that there are
many developers who want one thing and also many who want the other.
We faced this very thing in Perl, and you wouldn't believe how many 
messages and threads the issue spawned.  There are passionate views
on both sides of this issue.

The flaw in providing only the ASCII-only definitions as primitives is that
one cannot using those derive the full Unicode definitions, whereas if one
had the full Unicode definitions available as primitives, one could trivially
derive the ASCII-only definitions.  It's a matter of the choice of primitives.

Choosing ASCII-only as the bare primitive locks one into the 7-bit past in
what is even now very much a 21-bit world, and shall be even more so in
future.  You are sacrificing Unicode by choosing ASCII as the primitive.

But if you chose Unicode as the primitive, you would *not* be
sacrificing ASCII.  That makes it an unequal tradeoff between
the two sets of developers.  Favoring ASCII blocks Unicode, but
favoring Unicode does not block ASCII.  That isn't really fair.

I believe the only just thing is to provide both.  That's the only
way to make everyone happy.  That's what we finally arrived at in 
Perl, at least, and it has largely worked out.  There are still
grumblers about "reasonable" defaults, but no one is locked out--
as in Java, they currently are.

However, I understand that there are two separate issues here:

  * One is that you have used Unicode property names to mean
    something other than what the spec says they should mean.

  * The other is that the charclass aliases are either ASCII-only
    (\w \s \d) or broken (\b \B).

I have several different ideas about how to fix these in a backwards
compatible fashion, ideas I can discuss later in a separate letter.

Meanwhile, parts 2 and 3 of the current letter will discuss what
my rewrite code does and how it satisfies almost all of the unmet
requirements for Level 1 compliance, plus several others.

--tom