<i18n dev> Now what?

Tom Christiansen tchrist at perl.com
Wed Jan 26 06:09:57 PST 2011


Sherman wrote:

> The CR# so far I have are

> 7014645: Support Perl style Unicode hex notation \x{...}
> 7014633: Support loose matching forboth abbreviated and longer names of Unicode priperty
> 7014640: Add meta character for line ending '\R'

> It might take a couple days(?) for these CR# to show up on the website.

So it appears; they aren't there yet.  However, I see now that some of the
bugs I submitted last December *did* make it into the database.  The first
one shows that they've accepted that the \b vs \w thing is a bug.  I can't 
see how to fix that without bringing one into alignment with the other, but
maybe there's a way I'm not thinking of.

7006289: java.util.regex yields nonsense by breaking the connection between \b and \w

    Category    java:classes_util
    State       1-Dispatched, bug 
    Priority:   4-Low 
    Submit Date 12-DEC-2010

7006291: Java claims to support Unicode properties, but does not 

    Category 	java:classes_util 
    State 	1-Dispatched, bug 
    Priority: 	4-Low 
    Submit Date 12-DEC-2010

> Still need some time to scope/categorize those Unicode properties
> support issues, will post/send you the CR# when I have them and we can
> then discuss what we can do to address those issues going forward.

Perhaps there could be two RFEs, one for implementing the list of 
properties required for RL1.2, and the other for implementing the
remaining properties defined in the various UCD *.txt files that
you don't currently consider.

However, I do not know that a partial solution will work well for these.

For one thing, some of the properties you need for the first rely on
other underlying properties.  But also because to implement \X in the
required(ish) sense of an Extended Grapheme Cluster instead of as a
Legacy Grapheme Cluster, you need access to the properties that come 
out of the HangulSyllableType.txt UCD file.

If you have access to a "recent" source build of Perl (5.12 or better), 
you can see how the logic for \X is carried out during regex execution
by looking around line 3873 and after of regexec.c, which reads

        case CLUMP: /* Match \X: logical Unicode character.  This is defined as

Hope this helps.

--tom


More information about the i18n-dev mailing list