Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Sat Apr 23 08:12:10 UTC 2011

  The flag this request proposed to add is

  UNICODE_CHARSET

not the "UNICODE_UNICODE" in last email.

My apology for the typo.

Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then it
became UNICODE_CHARSET, considering the unicode_case.

-Sherman

On 4/23/2011 1:00 AM, Xueming Shen wrote:
>  Hi
>
> This proposal tries to address
>
> (1)  j.u.regex does not meet Unicode regex's Simple Word Boundaries 
> [1] requirement as Tom pointed
> out in his email on i18n-dev list [2]. Basically we have 3 problems here.
>
>     a. ju.regex word boundary construct \b and \B uses Unicode 
> \p{letter} + \p{digit} as the "word"
>         definition when the standard requires the true Unicode 
> \p{Alphabetic} property be used instead.
>         It also neglects two of the specifically required characters:
>         U+200C ZERO WIDTH NON-JOINER
>         U+200D ZERO WIDTH JOINER
>         (or the "word" could be \p{alphabetic} + \p{gc=Mark} + 
> \p{digit + \p{gc=Connector_Punctuation}, if
>         follow Annex C).
>     b. j.u.regex's word construct \w and \W are ASCII only version
>     c. It breaks the historical connection between word characters and 
> word boundaries (because of
>         a) and b). For example "élève" is NOT matched by the \b\w+\b 
> pattern)
>
> (2) j.u.regex does not meet Unicode regex's Properties requirement 
> [3][5][6][7]. Th main issues are
>
>     a. Alphabetic: totally missing from the platform, not only regex
>     b. Lowercase, Uppercase and White_Space: Java implementation (via 
> \p{javaMethod} is different
>         compared to Unicode Standard definition.
>     c. j.u.regex's POSIX character classes are ASCII only, when 
> standard has an Unicode version defined
>         at tr#18 Annex C [3]
>
> As the solution, I propose to
>
> (1) add a flag UNICODE_UNICODE to
>     a) flip the ASCII only predefined character classes (\b \B \w \W 
> \d \D \s \S) and POSIX character
>         classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
>     b) enable the UNICODE_CASE (anything Unicode)
>
>     While ideally we would like to just evolve/upgrade the Java regex 
> from the aged "ascii-only"
>     to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),  
> like what Perl did. But
>     given the Java's "compatibility" spirit (and the performance 
> concern as well), this is unlikely to
>     happen.
>
> (2) add \p{IsBinaryProperty} to explicitly support some important 
> Unicode binary properties, such
>     as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with 
> this j.u.regex can easily access
>     some properties that are either not provided by j.l.Character 
> directly or j.l.Character has a
>     different version (for example the White_Space).
>     (The missing alphabetic, different uppercase/lowercase issue has 
> been/is being addressed at
>     Cr#7037261 [4], any reviewer?)
>
> The webrev is at
> http://cr.openjdk.java.net/~sherman/7039066/webrev/
>
> The corresponding updated api j.u.regex.Pattern API doc is at
> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
>
> Specdiff result is at
> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
>
> I will file the CCC request if the API change proposal in webrev is 
> approved. This is coming in very late
> so it is possible that it may be held back until Java 8, if it can not 
> make the cutoff for jdk7.
>
> -Sherman
>
>
> [1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
> [2] 
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
> [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
> [4] 
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
> [5] 
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
> [6] 
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
> [7] 
> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html