<i18n dev> regex rewriting code (part 1 of 3)

Tue Jan 25 11:50:08 PST 2011

Tom,

The fact that these POSIX/ASCII only version properties/constructs have been
there for years ("compatibility") and it appears that "most" developers 
are happy
(habit, performance...) with them, I don't think we can and want to 
switch to the
Unicode version, simply for conformance.  Java takes compatibility and 
performance
very serious, especially the compatibility.  Lots of applications 
running over there
don't want to have surprise when migrated to new Java runtime version. 
Name space
conflict is really not a big issue (for me anyway) a possible solution 
is to have a
prefix "Is" for all Unicode binary properties, for example "IsAlpha", 
"IsLowerCase",
the problem we have here is to to provide the TR#18 compatible version 
for those
listed properties, if we want to continue claim tr#18 level 1.

The dis-connection between \b and \w is a headache, need to figure out a 
solution
(to be a better regex, not required by tr18, yet)

-Sherman

On 01/25/2011 10:52 AM, Tom Christiansen wrote:
> Sherman, referring to Java's ASCII-only senses of \w and \s,
> and of \p{alpha} and \p{space}, wrote:
>
>> (does Perl 5 work in this way as well?)
> No, not for a very, very long time.  For most of Perl's life,
> charclass escapes like \w have always been Unicode aware.
>
> However, it did take us some time to separate out the POSIX
> names from the Unicode properties.  As I've mentioned, we
> eventually solved this by prepending "POSIX" to the name of the
> property.  So \p{POSIX_Alpha} gets you exactly what POSIX says--
> which in Perl is locale-aware; I don't believe this is true of
> Java, though.  Whereas \p{Alpha} gets you \p{Alphabetic} as
> RL1.2a requires under both recommendations.
>
> As for \w and such, Perl defines \w, \d, \s, and \b -- and their
> uppercase complements -- to work exactly as the definitions given
> Annex C of tr18's RL1.2a state that they work per the Standard
> Recommendation.
>
> For \d, there is some flexibility in that the POSIX Compatible version
> allows it to match only [0-9] instead of all of \p{Decimal_Number}.
>
> To meet the requirements of RL1.2a, one must state whether one is using
> Standard Recommendation or the POSIX Compatible version.  Only for \d
> does Java use either of the allowable senses.  The others all choose
> their own definitions which are out of compliance with RL1.2a.  (And
> Java does not support Annex C's \X at all.  I know that that one
> is on your own personal wish-list, Sherman.)
>
> All of this is what first motivated me to write a drop-in
> replacement that preprocesses Pattern strings to allow them to
> work properly (read: per RL1.2a and others) on Unicode strings.
>
> And it was because of that code that you first became known to
> me, and vice versa.  So I think I should discuss it a bit.
> That's what parts 2 and 3 will be about.
>
>> This is by design and I don't agree "this is a mess" conclusion.
> Sherman, you're right that just because things like \w and \s,
> or \p{alpha} or \p{space}, do not meet the requirements of RL1.2a
> does not lead one to conclude that "this is a mess."  That would
> be grossly overstating matters.  Alone, it is simply non-conformant,
> not a mess.
>
> What I meant was a mess was the mismatch between \w and \b.
> It is this mismatch that makes possible nonsense results like
> I wrote about here:
>
>      One fundamental bug is that Java has misunderstood the connection
>      between \b and \w regexes, so that now a string like "élève" is not
>      matched by the pattern "\b\w+\b" at any point in the string.
>
> It turns out that because of this, Java is out of compliance with
> any of the permissible senses of \b and \w given in tr18.  I will
> demonstrate that in part 3 of this letter, as well as provide code
> demonstrating a remedy.
>
>> While there are developers over there might like these properties to
>> evolve to be the Unicode properties, I am pretty much sure there might
>> be the same amount of developers there would prefer these properties
>> be kept as the "original" POSIX properties.
> My experience suggests that you are indeed correct that there are
> many developers who want one thing and also many who want the other.
> We faced this very thing in Perl, and you wouldn't believe how many
> messages and threads the issue spawned.  There are passionate views
> on both sides of this issue.
>
> The flaw in providing only the ASCII-only definitions as primitives is that
> one cannot using those derive the full Unicode definitions, whereas if one
> had the full Unicode definitions available as primitives, one could trivially
> derive the ASCII-only definitions.  It's a matter of the choice of primitives.
>
> Choosing ASCII-only as the bare primitive locks one into the 7-bit past in
> what is even now very much a 21-bit world, and shall be even more so in
> future.  You are sacrificing Unicode by choosing ASCII as the primitive.
>
> But if you chose Unicode as the primitive, you would *not* be
> sacrificing ASCII.  That makes it an unequal tradeoff between
> the two sets of developers.  Favoring ASCII blocks Unicode, but
> favoring Unicode does not block ASCII.  That isn't really fair.
>
> I believe the only just thing is to provide both.  That's the only
> way to make everyone happy.  That's what we finally arrived at in
> Perl, at least, and it has largely worked out.  There are still
> grumblers about "reasonable" defaults, but no one is locked out--
> as in Java, they currently are.
>
> However, I understand that there are two separate issues here:
>
>    * One is that you have used Unicode property names to mean
>      something other than what the spec says they should mean.
>
>    * The other is that the charclass aliases are either ASCII-only
>      (\w \s \d) or broken (\b \B).
>
> I have several different ideas about how to fix these in a backwards
> compatible fashion, ideas I can discuss later in a separate letter.
>
> Meanwhile, parts 2 and 3 of the current letter will discuss what
> my rewrite code does and how it satisfies almost all of the unmet
> requirements for Level 1 compliance, plus several others.
>
> --tom