Codereview request for 7014640: To add a metachar \R for line ending and character classes for vertical/horizontal ws \v \V \h \H
Xueming Shen
xueming.shen at oracle.com
Tue May 1 18:06:55 UTC 2012
Hi,
Just noticed that webrev url was pointing to the blenderrev. The webrev
is at
http://cr.openjdk.java.net/~sherman/7014640/webrev
Btw, this one has been approved by CCC.
thanks,
-Sherman
On 04/21/2012 12:56 AM, Xueming Shen wrote:
> Hi
>
> Here are the webrev and blenderrev for the proposed change to add 5
> new regex constructs \R \v \V \h \V.
>
> \R: recommended by Unicode Regex TR#18 for matching all line ending
> characters and sequences, is equivalent to
> ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )
>
> \h, \v, \H and \V:
> matches any character considered to (not) be horizontal/vertical
> whitespace.
>
> Webrev:
> http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html
>
> Blenderrev:
> http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html
>
> new Pattern api
> http://cr.openjdk.java.net/~sherman/7014640/Pattern.html
>
> Here are couple notes regarding the spec/implementation.
>
> (1) \v was implemented as \u000B ('\013'), but not documented (did not
> appear in our API
> doc as one supported construct, such as \t \r \n...). To define \v as
> a "general" construct for
> all vertical whitespace characters might trigger some compatibility
> concerns (more characters
> are now matched by \v). But given this is a never documented
> implementation detail and the
> \u000B is still being matched by \v, I would consider this as an
> acceptable behavior change.
>
> (2) a predefined character class can appear inside another character
> class, for example
> you can have [...\v...], however, since it represents a "class" of
> character, so it can't be
> a start or end code point of a range inside a class, so you can have
> [a-b], but you can't
> have [\h-...] or [...-\h] (exception will be thrown). But for \v,
> since it was implemented
> as \u000B (VT), you were able to put it as a start or end value of a
> range, I feel it'd be
> better still keep it the way it worked before, so [\v-\v] works and
> will match the VT in
> this implementation.
>
> (3) The newly added \h\v\H\V constructs are all "Unicode version" of
> character classes, the
> rest of the "predefined character class" family (\d\D\s\S\w\W) are
> ASCII only, you will have to
> turn on flag UNICODE_CHARACTER_CLASS to get the Unicode version of
> these constructs. This
> is kinda of inconsistent. Perl's corresponding constructs work in a
> similar way, all Perl's \d\D\s\S
> \w\W\v\V\h\H work in Unicode version, and to have a \a modifier to
> turn the \d\D\s\S\w\W
> back to ASCII mode but not for \h\v\H\V. We had the discussion back
> into JDK7 regarding the
> ASCII vs Unicode for these constructs, the decision then was to keep
> these predefined character
> classes (and POSIX character classes) ASCII by default, to have a flag
> UNICODE_CHARACTER_CLASS
> to turn them into Unicode version. Given there is NOT an ASCII version
> in Perl and we didn't
> have ASCII version in Java regex to trigger compatibility concern, I
> feel it might be better to
> just have a simple Unicode version of \h\v\H\V.
>
> (4)\R is not a character class, since it matched \r\n.
>
> This one will need to go through ccc process.
>
> Thanks,
> -Sherman
More information about the core-libs-dev
mailing list