Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
Xueming Shen
xueming.shen at oracle.com
Sat Apr 23 08:12:10 UTC 2011
The flag this request proposed to add is
UNICODE_CHARSET
not the "UNICODE_UNICODE" in last email.
My apology for the typo.
Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then it
became UNICODE_CHARSET, considering the unicode_case.
-Sherman
On 4/23/2011 1:00 AM, Xueming Shen wrote:
> Hi
>
> This proposal tries to address
>
> (1) j.u.regex does not meet Unicode regex's Simple Word Boundaries
> [1] requirement as Tom pointed
> out in his email on i18n-dev list [2]. Basically we have 3 problems here.
>
> a. ju.regex word boundary construct \b and \B uses Unicode
> \p{letter} + \p{digit} as the "word"
> definition when the standard requires the true Unicode
> \p{Alphabetic} property be used instead.
> It also neglects two of the specifically required characters:
> U+200C ZERO WIDTH NON-JOINER
> U+200D ZERO WIDTH JOINER
> (or the "word" could be \p{alphabetic} + \p{gc=Mark} +
> \p{digit + \p{gc=Connector_Punctuation}, if
> follow Annex C).
> b. j.u.regex's word construct \w and \W are ASCII only version
> c. It breaks the historical connection between word characters and
> word boundaries (because of
> a) and b). For example "élève" is NOT matched by the \b\w+\b
> pattern)
>
> (2) j.u.regex does not meet Unicode regex's Properties requirement
> [3][5][6][7]. Th main issues are
>
> a. Alphabetic: totally missing from the platform, not only regex
> b. Lowercase, Uppercase and White_Space: Java implementation (via
> \p{javaMethod} is different
> compared to Unicode Standard definition.
> c. j.u.regex's POSIX character classes are ASCII only, when
> standard has an Unicode version defined
> at tr#18 Annex C [3]
>
> As the solution, I propose to
>
> (1) add a flag UNICODE_UNICODE to
> a) flip the ASCII only predefined character classes (\b \B \w \W
> \d \D \s \S) and POSIX character
> classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
> b) enable the UNICODE_CASE (anything Unicode)
>
> While ideally we would like to just evolve/upgrade the Java regex
> from the aged "ascii-only"
> to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),
> like what Perl did. But
> given the Java's "compatibility" spirit (and the performance
> concern as well), this is unlikely to
> happen.
>
> (2) add \p{IsBinaryProperty} to explicitly support some important
> Unicode binary properties, such
> as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with
> this j.u.regex can easily access
> some properties that are either not provided by j.l.Character
> directly or j.l.Character has a
> different version (for example the White_Space).
> (The missing alphabetic, different uppercase/lowercase issue has
> been/is being addressed at
> Cr#7037261 [4], any reviewer?)
>
> The webrev is at
> http://cr.openjdk.java.net/~sherman/7039066/webrev/
>
> The corresponding updated api j.u.regex.Pattern API doc is at
> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
>
> Specdiff result is at
> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
>
> I will file the CCC request if the API change proposal in webrev is
> approved. This is coming in very late
> so it is possible that it may be held back until Java 8, if it can not
> make the cutoff for jdk7.
>
> -Sherman
>
>
> [1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
> [2]
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
> [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
> [4]
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
> [5]
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
> [6]
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
> [7]
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html
More information about the core-libs-dev
mailing list