<i18n dev> Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Sat Apr 23 09:09:55 PDT 2011

The changes sound good. The flag UNICODE_CHARSET will be misleading, since
all of Java uses the Unicode Charset (= encoding). How about:

UNICODE_SPEC

or something that gives that flavor.

Mark

*— Il meglio è l’inimico del bene —*

On Sat, Apr 23, 2011 at 01:12, Xueming Shen <xueming.shen at oracle.com> wrote:

>  The flag this request proposed to add is
>
>  UNICODE_CHARSET
>
> not the "UNICODE_UNICODE" in last email.
>
> My apology for the typo.
>
> Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then
> it
> became UNICODE_CHARSET, considering the unicode_case.
>
> -Sherman
>
>
> On 4/23/2011 1:00 AM, Xueming Shen wrote:
>
>>  Hi
>>
>> This proposal tries to address
>>
>> (1)  j.u.regex does not meet Unicode regex's Simple Word Boundaries [1]
>> requirement as Tom pointed
>> out in his email on i18n-dev list [2]. Basically we have 3 problems here.
>>
>>    a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} +
>> \p{digit} as the "word"
>>        definition when the standard requires the true Unicode
>> \p{Alphabetic} property be used instead.
>>        It also neglects two of the specifically required characters:
>>        U+200C ZERO WIDTH NON-JOINER
>>        U+200D ZERO WIDTH JOINER
>>        (or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit +
>> \p{gc=Connector_Punctuation}, if
>>        follow Annex C).
>>    b. j.u.regex's word construct \w and \W are ASCII only version
>>    c. It breaks the historical connection between word characters and word
>> boundaries (because of
>>        a) and b). For example "élève" is NOT matched by the \b\w+\b
>> pattern)
>>
>> (2) j.u.regex does not meet Unicode regex's Properties requirement
>> [3][5][6][7]. Th main issues are
>>
>>    a. Alphabetic: totally missing from the platform, not only regex
>>    b. Lowercase, Uppercase and White_Space: Java implementation (via
>> \p{javaMethod} is different
>>        compared to Unicode Standard definition.
>>    c. j.u.regex's POSIX character classes are ASCII only, when standard
>> has an Unicode version defined
>>        at tr#18 Annex C [3]
>>
>> As the solution, I propose to
>>
>> (1) add a flag UNICODE_UNICODE to
>>    a) flip the ASCII only predefined character classes (\b \B \w \W \d \D
>> \s \S) and POSIX character
>>        classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
>>    b) enable the UNICODE_CASE (anything Unicode)
>>
>>    While ideally we would like to just evolve/upgrade the Java regex from
>> the aged "ascii-only"
>>    to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),  like
>> what Perl did. But
>>    given the Java's "compatibility" spirit (and the performance concern as
>> well), this is unlikely to
>>    happen.
>>
>> (2) add \p{IsBinaryProperty} to explicitly support some important Unicode
>> binary properties, such
>>    as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this
>> j.u.regex can easily access
>>    some properties that are either not provided by j.l.Character directly
>> or j.l.Character has a
>>    different version (for example the White_Space).
>>    (The missing alphabetic, different uppercase/lowercase issue has
>> been/is being addressed at
>>    Cr#7037261 [4], any reviewer?)
>>
>> The webrev is at
>> http://cr.openjdk.java.net/~sherman/7039066/webrev/
>>
>> The corresponding updated api j.u.regex.Pattern API doc is at
>> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
>>
>> Specdiff result is at
>> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
>>
>> I will file the CCC request if the API change proposal in webrev is
>> approved. This is coming in very late
>> so it is possible that it may be held back until Java 8, if it can not
>> make the cutoff for jdk7.
>>
>> -Sherman
>>
>>
>> [1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
>> [2]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
>> [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
>> [4]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
>> [5]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
>> [6]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
>> [7]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110423/73423b07/attachment.html