Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
Xueming Shen
xueming.shen at oracle.com
Sat Apr 23 08:00:08 UTC 2011
Hi
This proposal tries to address
(1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1]
requirement as Tom pointed
out in his email on i18n-dev list [2]. Basically we have 3 problems here.
a. ju.regex word boundary construct \b and \B uses Unicode
\p{letter} + \p{digit} as the "word"
definition when the standard requires the true Unicode
\p{Alphabetic} property be used instead.
It also neglects two of the specifically required characters:
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
(or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit
+ \p{gc=Connector_Punctuation}, if
follow Annex C).
b. j.u.regex's word construct \w and \W are ASCII only version
c. It breaks the historical connection between word characters and
word boundaries (because of
a) and b). For example "élève" is NOT matched by the \b\w+\b
pattern)
(2) j.u.regex does not meet Unicode regex's Properties requirement
[3][5][6][7]. Th main issues are
a. Alphabetic: totally missing from the platform, not only regex
b. Lowercase, Uppercase and White_Space: Java implementation (via
\p{javaMethod} is different
compared to Unicode Standard definition.
c. j.u.regex's POSIX character classes are ASCII only, when
standard has an Unicode version defined
at tr#18 Annex C [3]
As the solution, I propose to
(1) add a flag UNICODE_UNICODE to
a) flip the ASCII only predefined character classes (\b \B \w \W \d
\D \s \S) and POSIX character
classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
b) enable the UNICODE_CASE (anything Unicode)
While ideally we would like to just evolve/upgrade the Java regex
from the aged "ascii-only"
to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),
like what Perl did. But
given the Java's "compatibility" spirit (and the performance
concern as well), this is unlikely to
happen.
(2) add \p{IsBinaryProperty} to explicitly support some important
Unicode binary properties, such
as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with
this j.u.regex can easily access
some properties that are either not provided by j.l.Character
directly or j.l.Character has a
different version (for example the White_Space).
(The missing alphabetic, different uppercase/lowercase issue has
been/is being addressed at
Cr#7037261 [4], any reviewer?)
The webrev is at
http://cr.openjdk.java.net/~sherman/7039066/webrev/
The corresponding updated api j.u.regex.Pattern API doc is at
http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
Specdiff result is at
http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
I will file the CCC request if the API change proposal in webrev is
approved. This is coming in very late
so it is possible that it may be held back until Java 8, if it can not
make the cutoff for jdk7.
-Sherman
[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
[2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
[5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
[6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
[7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html
More information about the core-libs-dev
mailing list