<i18n dev> RL1.4 Simple Word Boundaries
Xueming Shen
xueming.shen at oracle.com
Sun Jan 23 22:37:06 PST 2011
Tom,
Thanks for the detailed and excellent "reality check". While I'm still
going through all the details
it appears that the fact the current Java Unicode property data does not
include the properties
defined in PropList.txt (current implementation reads the property data
only from UnicodeData.txt,
Scripts, Blocks and SpecialCasing.txt,) contributes to lots of issues
raised, which means property
data of Other_Alphabetic/Lowercse/Uppercase and White_Space are not
available for j.u.regex
and j.l.Character. j.u.regex is trying the "closest" possible set for
the alphabetic, lower/uppercase,
I will file a RFE to trace this issue.
Regarding RL1.4.(1), the U+200C and U+2000 are obviously a bug that the
Java regex failed
to update the implementation to sync with the tr#18 update, it appears
these two don't "exists" in
RL1.4/v9, neither does RL1.2a, the compatibility properties.
The words for 1.4(1) actually are actually little confusing for me
The class of<word_character> includes all the Alphabetic
values from the Unicode character database, from
UnicodeData.txt [UData]...See also Annex C: Compatibility Properties.
The property "Alphabetic" is defined as Lu + Ll + Lt + Lm + Lo + Nl +
Other_Alphabetic,
or Letter + Nl + Other_Alphabetic, the current java regex actually uses
Letter + Nd , so if we
interpret the "Alphabetic values" in 1.4(1) as code point with
Alphabetic//property, it appears
we will miss Nd, those digits as a <word_character>...does Perl include
Nd in latest
version, or only Letter + Nl?
Also, seems like it also "purposely" suggests "from UnicodeData.txt", so no
Other_Alphabetic, for 1.4(1) and leave those Mn for 1.4(2)?
Again, thanks for the long writing, I will go through them in details,
file corresponding
bug/rfe into our database and then follow up from there.
-Sherman
On 1-23-2011 11:44 11:44 AM, Tom Christiansen wrote:
> Java does not meet this requirement. Specifically, it
> does not offer a mechanism for stipulation #1 cited below:
>
> RL1.4 Simple Word Boundaries
>
> To meet this requirement, an implementation shall extend the
> word boundary mechanism so that:
>
> (1) The class of<word_character> includes all the Alphabetic
> values from the Unicode character database, from
> UnicodeData.txt [UData], plus the U+200C ZERO WIDTH NON-
> JOINER and U+200D ZERO WIDTH JOINER. See also Annex C:
> Compatibility Properties.
>
> (2) Nonspacing marks are never divided from their base
> characters, and otherwise ignored in locating boundaries.
>
> What Java *does* do is rather underdocumented, for you cannot learn
> what it does by reading
>
> http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
>
> alone. That says only
>
> \b A word boundary
> \B A non-word boundary
>
> and
>
> The string literal "\b", for example, matches a single
> backspace character when interpreted as a regular expression,
> while "\\b" matches a word boundary.
>
> \b has historically been defined in terms of \w, and indeed, in the text
> above one is given no reason to imagine to believe this means anything
> other than Java's word class, \w. And in RL1.2a we showed that Java does
> not correctly implement \w according to those requirements.
>
> However, the situation is much more complicated than that.
>
> From jdk1.7.0/java/util/regex/Pattern.html we learn that:
>
> /**
> * Handles word boundaries. Includes a field to allow this one class to
> * deal with the different types of word boundaries we can match. The word
> * characters include underscores, letters, and digits. Non spacing marks
> * can are also part of a word if they have a base character, otherwise
> * they are ignored for purposes of finding word boundaries.
> */
>
> This is a problem for multiple reasons.
>
> A. It again mistakenly uses \pL when the standard requires that
> the true \p{Alphabetic} property be used instead.
>
> B. It neglects two of the specifically required characters:
>
> U+200C ZERO WIDTH NON-JOINER
> U+200D ZERO WIDTH JOINER
>
> C. It breaks the historical connection between word characters
> and word boundaries.
>
> The first two of those three both disqualify Java from meeting RL1.3.
>
> A very strict reading of RL1.2a suggests that (C) does not necessarily
> disqualify Java from compliance, because it appears that the standard
> does not strictly *require* that \b be defined in terms of \w.
>
> However, Java has broken this connection for no sound reason. Java does
> not implement RL1.3, and it does not implement the more sophisticated
> boundaries discussed in Level 2--which are a superset of \w dependency
> moreso than an unlike alternative. In fact, not only does it not
> implement them, it does not allow the user to craft his own because Java
> does not support the Word_Break property which is indispensable for such
> hjigher-level constructs.
>
> Furthermore, it makes Java's regexes completely incompatible with anyone
> else's I have ever heard of. I know of know other language than Java
> that permitst the oxymoronic failure of the pattern \b\w+\b when there
> are word characters in a string such as "élève". Java fails to allow
> that pattern to match anywhere whatsoever, let alone the entire string.
>
> Finally, I believe there is no reading of tr18 that permits this completely
> counter-intuitive failure to match. Even if \b is implemented in a way
> that does not depend directly on \w, all other ways that the standard
> mentions *DO* properly allow "élève" to be matched by the \b\w+\b pattern.
>
> I believe this is a terribly unfortunate bug that must be fixed for Java's
> regular expressions to be useful, to work the way people expect them to work,
> and indeed to meet any possible reading of tr18.
>
> --tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110123/0a75ce47/attachment.html
More information about the i18n-dev
mailing list