<i18n dev> RL1.4 Simple Word Boundaries

Mon Jan 24 13:41:30 PST 2011

Sherman wrote:

> Regarding RL1.4.(1), the U+200C and  U+2000 are obviously a bug that
> the Java regex failed to update the implementation to sync with the
> tr#18 update, it appears these two don't "exists" in RL1.4/v9,
> neither does RL1.2a, the compatibility properties.

> The words for 1.4(1) actually are actually little confusing for me

Yes, for me, too.

>     The class of<word_character>  includes all the Alphabetic values
>     from the Unicode character database, from UnicodeData.txt
>     [UData]...See also Annex C: Compatibility Properties.

> The property "Alphabetic" is defined as Lu + Ll + Lt + Lm + Lo + Nl +
> Other_Alphabetic, or Letter + Nl + Other_Alphabetic, the current java
> regex actually uses Letter + Nd , so if we interpret the "Alphabetic
> values" in 1.4(1) as code point with Alphabetic//property, it appears
> we will miss Nd, those digits as a <word_character>...does Perl
> include Nd in latest version, or only Letter + Nl?

Perl uses for its idea of a word (\w) the suggestion in RL1.2a's list 
of compatibility properties, which does include the Decimal_Numbers:

    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}

(That \p{alpha} is short for \p{Alphabetic}.)  Perl uses this expanded
definition of a word (that is, as a program identifier) in its current
interpretation of \b, per the recommendation in RL1.2a:

    If there is a requirement that \b align with \w, 
    then it would use the approximation above instead.

This is what we do because it is historical and fast.  Given the \w
definition above, that makes a \b in Perl is exactly equal to

     (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))

and a \B is exactly equal to 

    (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))

This is not a new definition; it's what \b and \B meant all the way back in
Perl 1.0 (which is where they first appeared, BTW).  It's just the \w that
got updated for Unicode according to the RL1.2a suggested, and the rest
followed from that implicitly.

While this can be enough for Level 1, we are are currently looking into
what it takes for Level 2 and Level 3 word boundaries.  This would likely
involve some of the syntax discussed in RL3.12 "Extended Grapheme
Clusters", such as \b{g} for a grapheme boundary and \b{w} for a 
word boundary.

Note that as of 5.12, Perl's \X *does* match an extended grapheme cluster,
not just the legacy grapheme clusters it matched previously.  Unicode added
the extended version since \X came around, and we updated \X.  My rewrite
library comes as close as you can to implemented the extended sense for \X
without having access to Hangul Syllable Type properties.

> Also, seems like it also "purposely" suggests "from UnicodeData.txt",
> so no Other_Alphabetic, for 1.4(1) and leave those Mn for 1.4(2)?

I *think* that because they use "Alphabetic" with a capital letter 
instead of "alphabetic" with a small one, that they do mean the 
specific Unicode property named "Alphabetic" instead of the general
English word "alphabetic".  

I've looked back at older versions of tr18, and my sense is that this is
one of those places where the English descriptions didn't get completely
updated to match to the new layout.  I believe that at some historical
point there was the notion that the Unicode Character Database comprised
only UnicodeData.txt.  Today we think of the UCD as being all the files in
the directory, so also PropList.txt and everything else.

That's the only way I can make sense of this, and it does seem to follow
from the revision dates.

The trouble of property names conflicting between various non-Unicode
"namespaces" is one I'm familiar with.  Perl has gone through some serious
discomfort because of historically conflating POSIX names or builtin Perl
names with Unicode names.  Unicode defines things like space, alpha, etc in
ways that don't line up with the POSIX senses.  The way we've finally
resolved that is by using POSIX_* and Perl_* prefixes as necessary to 
avoid conflict with Unicode names.

In contrast, the ICU UCharacter class tends to do things like
Ucharacter.isUUppercase for the Unicode \p{Uppercase} property,
Ucharacter.isUWhiteSpace for the Unicode \p{WhiteSpace} property,  etc.
That leaves the old non-Unicode senses of Character.isWhitespace etc
intact.  That means you have to do something extra to get the Unicode
versions of the same name; in Perl we decided you had to do something extra
to get the non-Unicode versions.  It really doesn't matter which way you go
so long as both remain accessible.  Contributing factors in deciding which
way to go on this likely include the sometimes conflicting:

    * meeting expectations: the two principles of (1) least
      surprise and of (2) reasonable defaults

    * existing code 

    * backwards compatibility

    * future convenience

On the other hand, since for j.lCharacter properties one can always use
\p{javaXXX}, so that isn't ambiguous. (With loose matching, that would also
work with \p{Java_XXX} and p\{Java XXX}.)

Regarding RL1.2a, I notice that ICU seems to have gone more in the Perl
direction.  From

    http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html

    API access for C/POSIX character classes is as follows: 

     - alpha:     isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC)
     - lower:     isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE)
     - upper:     isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE)
     - punct:     ((1<<getType(c)) & ((1<<DASH_PUNCTUATION)|(1<<START_PUNCTUATION)|
		   (1<<END_PUNCTUATION)|(1<<CONNECTOR_PUNCTUATION)|(1<<OTHER_PUNCTUATION)|
		   (1<<INITIAL_PUNCTUATION)|(1<<FINAL_PUNCTUATION)))!=0
     - digit:     isDigit(c) or getType(c)==DECIMAL_DIGIT_NUMBER
     - xdigit:    hasBinaryProperty(c, UProperty.POSIX_XDIGIT)
     - alnum:     hasBinaryProperty(c, UProperty.POSIX_ALNUM)
     - space:     isUWhiteSpace(c) or hasBinaryProperty(c, UProperty.WHITE_SPACE)
     - blank:     hasBinaryProperty(c, UProperty.POSIX_BLANK)
     - cntrl:     getType(c)==CONTROL
     - graph:     hasBinaryProperty(c, UProperty.POSIX_GRAPH)
     - print:     hasBinaryProperty(c, UProperty.POSIX_PRINT)

And here you'll notice that ICU does have *all* the modern
Unicode properties covered:

    http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html
    http://userguide.icu-project.org/strings/properties

Those definitions from ICU UCharacter and UProperty are what I had always
expected j.l.Character itself to do, including full Unicode properties and
POSIX compat properties in j.l.u.regex.  But that expectation might be
because of my own background and experience.  

It may be that you might be able to use some of their approaches
for parsing the UCD into properties.  

> Again, thanks for the long writing, I will go through them in
> details, file corresponding bug/rfe into our database and then
> follow up from there.

I really hope it can be of some help to everyone.  I feel this can make a
big difference to anyone doing modern text processing (by which I mean
Unicode) not just on Java, but on everything else that compiles down JVM
code that calls the j.l.Character and j.u.r.Pattern classes.

I'm *really* happy that I can bring these sorts of issues to the people
that can make a difference.  I don't mean in the "fix it now" way.  I mean
in the deliberated, long-term way.  Getting them properly registered in the
"bugs&wishes" database with all right language now will allow them to stand
a chance of someday getting adequate resources allocated toward them for
some future release.

I do have 3-5 different possible ideas for how best to maintain backwards
compatibility yet still provide for maximum usefulness and room for future
expansion.  But we can talk about that part later.

Again Sherman, thank you very very much for taking the time to consider my
overly long writings about getting better Unicode support into Java.

--tom