<i18n dev> RL1.4 Simple Word Boundaries

Tom Christiansen tchrist at perl.com
Sun Jan 23 11:44:48 PST 2011


Java does not meet this requirement.  Specifically, it 
does not offer a mechanism for stipulation #1 cited below:

    RL1.4       Simple Word Boundaries

    To meet this requirement, an implementation shall extend the
    word boundary mechanism so that:

    (1) The class of <word_character> includes all the Alphabetic
        values from the Unicode character database, from
        UnicodeData.txt [UData], plus the U+200C ZERO WIDTH NON-
        JOINER and U+200D ZERO WIDTH JOINER. See also Annex C:
        Compatibility Properties.

    (2) Nonspacing marks are never divided from their base
        characters, and otherwise ignored in locating boundaries.

What Java *does* do is rather underdocumented, for you cannot learn 
what it does by reading 

    http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

alone.  That says only 

    \b      A word boundary
    \B      A non-word boundary

and

    The string literal "\b", for example, matches a single
    backspace character when interpreted as a regular expression,
    while "\\b" matches a word boundary.

\b has historically been defined in terms of \w, and indeed, in the text
above one is given no reason to imagine to believe this means anything
other than Java's word class, \w.  And in RL1.2a we showed that Java does
not correctly implement \w according to those requirements.

However, the situation is much more complicated than that.

>From jdk1.7.0/java/util/regex/Pattern.html we learn that:

    /**
     * Handles word boundaries. Includes a field to allow this one class to
     * deal with the different types of word boundaries we can match. The word
     * characters include underscores, letters, and digits. Non spacing marks
     * can are also part of a word if they have a base character, otherwise
     * they are ignored for purposes of finding word boundaries.
     */

This is a problem for multiple reasons.  

 A. It again mistakenly uses \pL when the standard requires that
    the true \p{Alphabetic} property be used instead.
 
 B. It neglects two of the specifically required characters:

	U+200C ZERO WIDTH NON-JOINER 
	U+200D ZERO WIDTH JOINER

 C. It breaks the historical connection between word characters 
    and word boundaries.
    
The first two of those three both disqualify Java from meeting RL1.3.  

A very strict reading of RL1.2a suggests that (C) does not necessarily
disqualify Java from compliance, because it appears that the standard 
does not strictly *require* that \b be defined in terms of \w.

However, Java has broken this connection for no sound reason.  Java does
not implement RL1.3, and it does not implement the more sophisticated
boundaries discussed in Level 2--which are a superset of \w dependency
moreso than an unlike alternative.   In fact, not only does it not
implement them, it does not allow the user to craft his own because Java
does not support the Word_Break property which is indispensable for such 
hjigher-level constructs.

Furthermore, it makes Java's regexes completely incompatible with anyone
else's I have ever heard of.  I know of know other language than Java
that permitst the oxymoronic failure of the pattern \b\w+\b when there
are word characters in a string such as "élève".  Java fails to allow
that pattern to match anywhere whatsoever, let alone the entire string.

Finally, I believe there is no reading of tr18 that permits this completely
counter-intuitive failure to match.  Even if \b is implemented in a way
that does not depend directly on \w, all other ways that the standard
mentions *DO* properly allow "élève" to be matched by the \b\w+\b pattern.

I believe this is a terribly unfortunate bug that must be fixed for Java's
regular expressions to be useful, to work the way people expect them to work,
and indeed to meet any possible reading of tr18.

--tom


More information about the i18n-dev mailing list