Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

23 Apr 2011

      Hi

This proposal tries to address

(1)  j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] 
requirement as Tom pointed
out in his email on i18n-dev list [2]. Basically we have 3 problems here.

     a. ju.regex word boundary construct \b and \B uses Unicode 
\p{letter} + \p{digit} as the "word"
         definition when the standard requires the true Unicode 
\p{Alphabetic} property be used instead.
         It also neglects two of the specifically required characters:
         U+200C ZERO WIDTH NON-JOINER
         U+200D ZERO WIDTH JOINER
         (or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit 
+ \p{gc=Connector_Punctuation}, if
         follow Annex C).
     b. j.u.regex's word construct \w and \W are ASCII only version
     c. It breaks the historical connection between word characters and 
word boundaries (because of
         a) and b). For example "élève" is NOT matched by the \b\w+\b 
pattern)

(2) j.u.regex does not meet Unicode regex's Properties requirement 
[3][5][6][7]. Th main issues are

     a. Alphabetic: totally missing from the platform, not only regex
     b. Lowercase, Uppercase and White_Space: Java implementation (via 
\p{javaMethod} is different
         compared to Unicode Standard definition.
     c. j.u.regex's POSIX character classes are ASCII only, when 
standard has an Unicode version defined
         at tr#18 Annex C [3]

As the solution, I propose to

(1) add a flag UNICODE_UNICODE to
     a) flip the ASCII only predefined character classes (\b \B \w \W \d 
\D \s \S) and POSIX character
         classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
     b) enable the UNICODE_CASE (anything Unicode)

     While ideally we would like to just evolve/upgrade the Java regex 
from the aged "ascii-only"
     to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),  
like what Perl did. But
     given the Java's "compatibility" spirit (and the performance 
concern as well), this is unlikely to
     happen.

(2) add \p{IsBinaryProperty} to explicitly support some important 
Unicode binary properties, such
     as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with 
this j.u.regex can easily access
     some properties that are either not provided by j.l.Character 
directly or j.l.Character has a
     different version (for example the White_Space).
     (The missing alphabetic, different uppercase/lowercase issue has 
been/is being addressed at
     Cr#7037261 [4], any reviewer?)

The webrev is at
http://cr.openjdk.java.net/~sherman/7039066/webrev/

The corresponding updated api j.u.regex.Pattern API doc is at
http://cr.openjdk.java.net/~sherman/7039066/Pattern.html

Specdiff result is at
http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html

I will file the CCC request if the API change proposal in webrev is 
approved. This is coming in very late
so it is possible that it may be held back until Java 8, if it can not 
make the cutoff for jdk7.

-Sherman

[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
[2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
[5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
[6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
[7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html

Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Xueming Shen