<i18n dev> Now what?
Tom Christiansen
tchrist at perl.com
Wed Jan 26 06:09:57 PST 2011
Sherman wrote:
> The CR# so far I have are
> 7014645: Support Perl style Unicode hex notation \x{...}
> 7014633: Support loose matching forboth abbreviated and longer names of Unicode priperty
> 7014640: Add meta character for line ending '\R'
> It might take a couple days(?) for these CR# to show up on the website.
So it appears; they aren't there yet. However, I see now that some of the
bugs I submitted last December *did* make it into the database. The first
one shows that they've accepted that the \b vs \w thing is a bug. I can't
see how to fix that without bringing one into alignment with the other, but
maybe there's a way I'm not thinking of.
7006289: java.util.regex yields nonsense by breaking the connection between \b and \w
Category java:classes_util
State 1-Dispatched, bug
Priority: 4-Low
Submit Date 12-DEC-2010
7006291: Java claims to support Unicode properties, but does not
Category java:classes_util
State 1-Dispatched, bug
Priority: 4-Low
Submit Date 12-DEC-2010
> Still need some time to scope/categorize those Unicode properties
> support issues, will post/send you the CR# when I have them and we can
> then discuss what we can do to address those issues going forward.
Perhaps there could be two RFEs, one for implementing the list of
properties required for RL1.2, and the other for implementing the
remaining properties defined in the various UCD *.txt files that
you don't currently consider.
However, I do not know that a partial solution will work well for these.
For one thing, some of the properties you need for the first rely on
other underlying properties. But also because to implement \X in the
required(ish) sense of an Extended Grapheme Cluster instead of as a
Legacy Grapheme Cluster, you need access to the properties that come
out of the HangulSyllableType.txt UCD file.
If you have access to a "recent" source build of Perl (5.12 or better),
you can see how the logic for \X is carried out during regex execution
by looking around line 3873 and after of regexec.c, which reads
case CLUMP: /* Match \X: logical Unicode character. This is defined as
Hope this helps.
--tom
More information about the i18n-dev
mailing list