Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
Hi This proposal tries to address (1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] requirement as Tom pointed out in his email on i18n-dev list [2]. Basically we have 3 problems here. a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} + \p{digit} as the "word" definition when the standard requires the true Unicode \p{Alphabetic} property be used instead. It also neglects two of the specifically required characters: U+200C ZERO WIDTH NON-JOINER U+200D ZERO WIDTH JOINER (or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit + \p{gc=Connector_Punctuation}, if follow Annex C). b. j.u.regex's word construct \w and \W are ASCII only version c. It breaks the historical connection between word characters and word boundaries (because of a) and b). For example "élève" is NOT matched by the \b\w+\b pattern) (2) j.u.regex does not meet Unicode regex's Properties requirement [3][5][6][7]. Th main issues are a. Alphabetic: totally missing from the platform, not only regex b. Lowercase, Uppercase and White_Space: Java implementation (via \p{javaMethod} is different compared to Unicode Standard definition. c. j.u.regex's POSIX character classes are ASCII only, when standard has an Unicode version defined at tr#18 Annex C [3] As the solution, I propose to (1) add a flag UNICODE_UNICODE to a) flip the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version b) enable the UNICODE_CASE (anything Unicode) While ideally we would like to just evolve/upgrade the Java regex from the aged "ascii-only" to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)), like what Perl did. But given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to happen. (2) add \p{IsBinaryProperty} to explicitly support some important Unicode binary properties, such as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this j.u.regex can easily access some properties that are either not provided by j.l.Character directly or j.l.Character has a different version (for example the White_Space). (The missing alphabetic, different uppercase/lowercase issue has been/is being addressed at Cr#7037261 [4], any reviewer?) The webrev is at http://cr.openjdk.java.net/~sherman/7039066/webrev/ The corresponding updated api j.u.regex.Pattern API doc is at http://cr.openjdk.java.net/~sherman/7039066/Pattern.html Specdiff result is at http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html I will file the CCC request if the API change proposal in webrev is approved. This is coming in very late so it is possible that it may be held back until Java 8, if it can not make the cutoff for jdk7. -Sherman [1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries [2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties [4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html [5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html [6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html [7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html
The flag this request proposed to add is UNICODE_CHARSET not the "UNICODE_UNICODE" in last email. My apology for the typo. Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then it became UNICODE_CHARSET, considering the unicode_case. -Sherman On 4/23/2011 1:00 AM, Xueming Shen wrote:
Hi
This proposal tries to address
(1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] requirement as Tom pointed out in his email on i18n-dev list [2]. Basically we have 3 problems here.
a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} + \p{digit} as the "word" definition when the standard requires the true Unicode \p{Alphabetic} property be used instead. It also neglects two of the specifically required characters: U+200C ZERO WIDTH NON-JOINER U+200D ZERO WIDTH JOINER (or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit + \p{gc=Connector_Punctuation}, if follow Annex C). b. j.u.regex's word construct \w and \W are ASCII only version c. It breaks the historical connection between word characters and word boundaries (because of a) and b). For example "élève" is NOT matched by the \b\w+\b pattern)
(2) j.u.regex does not meet Unicode regex's Properties requirement [3][5][6][7]. Th main issues are
a. Alphabetic: totally missing from the platform, not only regex b. Lowercase, Uppercase and White_Space: Java implementation (via \p{javaMethod} is different compared to Unicode Standard definition. c. j.u.regex's POSIX character classes are ASCII only, when standard has an Unicode version defined at tr#18 Annex C [3]
As the solution, I propose to
(1) add a flag UNICODE_UNICODE to a) flip the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version b) enable the UNICODE_CASE (anything Unicode)
While ideally we would like to just evolve/upgrade the Java regex from the aged "ascii-only" to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)), like what Perl did. But given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to happen.
(2) add \p{IsBinaryProperty} to explicitly support some important Unicode binary properties, such as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this j.u.regex can easily access some properties that are either not provided by j.l.Character directly or j.l.Character has a different version (for example the White_Space). (The missing alphabetic, different uppercase/lowercase issue has been/is being addressed at Cr#7037261 [4], any reviewer?)
The webrev is at http://cr.openjdk.java.net/~sherman/7039066/webrev/
The corresponding updated api j.u.regex.Pattern API doc is at http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
Specdiff result is at http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
I will file the CCC request if the API change proposal in webrev is approved. This is coming in very late so it is possible that it may be held back until Java 8, if it can not make the cutoff for jdk7.
-Sherman
[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries [2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties [4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html [5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html [6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html [7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html
The changes sound good. The flag UNICODE_CHARSET will be misleading, since all of Java uses the Unicode Charset (= encoding). How about: UNICODE_SPEC or something that gives that flavor. Mark *— Il meglio è l’inimico del bene —* On Sat, Apr 23, 2011 at 01:12, Xueming Shen <xueming.shen@oracle.com> wrote:
The flag this request proposed to add is
UNICODE_CHARSET
not the "UNICODE_UNICODE" in last email.
My apology for the typo.
Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then it became UNICODE_CHARSET, considering the unicode_case.
-Sherman
On 4/23/2011 1:00 AM, Xueming Shen wrote:
Hi
This proposal tries to address
(1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] requirement as Tom pointed out in his email on i18n-dev list [2]. Basically we have 3 problems here.
a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} + \p{digit} as the "word" definition when the standard requires the true Unicode \p{Alphabetic} property be used instead. It also neglects two of the specifically required characters: U+200C ZERO WIDTH NON-JOINER U+200D ZERO WIDTH JOINER (or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit + \p{gc=Connector_Punctuation}, if follow Annex C). b. j.u.regex's word construct \w and \W are ASCII only version c. It breaks the historical connection between word characters and word boundaries (because of a) and b). For example "élève" is NOT matched by the \b\w+\b pattern)
(2) j.u.regex does not meet Unicode regex's Properties requirement [3][5][6][7]. Th main issues are
a. Alphabetic: totally missing from the platform, not only regex b. Lowercase, Uppercase and White_Space: Java implementation (via \p{javaMethod} is different compared to Unicode Standard definition. c. j.u.regex's POSIX character classes are ASCII only, when standard has an Unicode version defined at tr#18 Annex C [3]
As the solution, I propose to
(1) add a flag UNICODE_UNICODE to a) flip the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version b) enable the UNICODE_CASE (anything Unicode)
While ideally we would like to just evolve/upgrade the Java regex from the aged "ascii-only" to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)), like what Perl did. But given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to happen.
(2) add \p{IsBinaryProperty} to explicitly support some important Unicode binary properties, such as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this j.u.regex can easily access some properties that are either not provided by j.l.Character directly or j.l.Character has a different version (for example the White_Space). (The missing alphabetic, different uppercase/lowercase issue has been/is being addressed at Cr#7037261 [4], any reviewer?)
The webrev is at http://cr.openjdk.java.net/~sherman/7039066/webrev/
The corresponding updated api j.u.regex.Pattern API doc is at http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
Specdiff result is at http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
I will file the CCC request if the API change proposal in webrev is approved. This is coming in very late so it is possible that it may be held back until Java 8, if it can not make the cutoff for jdk7.
-Sherman
[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries [2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties [4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html [5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html [6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html [7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html
Mark Davis ☕ <mark@macchiato.com> wrote on Sat, 23 Apr 2011 09:09:55 PDT:
The changes sound good.
They sure do, don't they? I'm quite happy about this. I think it is more important to get this in the queue than that it (necessarily) be done for JDK7. That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut makes it attractive now. But if not now, then soon is good enough.
The flag UNICODE_CHARSET will be misleading, since all of Java uses the Unicode Charset (= encoding). How about:
UNICODE_SPEC
or something that gives that flavor.
I hadn't thought of that, but I do see what you mean. The idea is that the semantics of \w etc change to match the Unicode spec in tr18. I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too broad a brush. What then happens when, as I imagine it someday shall, Java gets full support for RL2.3 boundaries, the way with ICU one uses or (?w) or UREGEX_UWORD for? Wouldn't calling something UNICODE_SPEC be too broad? Or should UNICODE_SPEC automatically include not just existing Unicode flags like UNICODE_CASE, but also any UREGEX_UWORD that comes along? If it does, you have back-compat issue, and if it doesn't, you have a misnaming issue. Seems like a bit of a Catch22. The reason I'd suggested UNICODE_CHARSET was because of my own background with the names we use for this within the Perl regex source code (which is itself written in C). I believe that Java doesn't have the same situation as gave rise to it in Perl, and perhaps something else would be clearer. Here's some background for why we felt we had to go that way. To control the behavior of \w and such, when a regex is compiled, a compiled Perl gets exactly one of these states: REGEX_UNICODE_CHARSET REGEX_LOCALE_CHARSET REGEX_ASCII_RESTRICTED_CHARSET REGEX_DEPENDS_CHARSET That state it normally inherits from the surrounding lexical scope, although this can be overridden with /u and /a, or (?u) and (?a), either within the pattern or as a separate pattern-compilation flag. REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the full RL1.2a definitions. Because Perl always does Unicode casemapping -- and full casemapping, too, not just simple -- we didn't need (?u) for what Java uses it for, which is just as an extra flavor of (?i); it doesn't do all that much. (BTW, the old default is *not* some sort of non-Unicode charset semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for code points > 255 and "maybe" so in the 128-255 range.) What we did certainly isn't perfect, but it allows for both backwards compat and future growth. This was because people want(ed) to be able to use regexes on both byte arrays yet also on character strings. Me, I think it's nuts to support this at all, that if you want an input stream in (say) CP1251 or ISO 8859-2, that you simply set that stream's encoding and be done with it: everything turns into characters internally. But there's old byte and locale code out there whose semantics we are loth to change out from under people. Java has the same kind of issue. The reason we ever support anything else is because we got (IMHO nasty) POSIX locales before we got Unicode support, which didn't happen till toward the end of the last millennium. So we're stuck supporting code well more than a decade old, perhaps indefinitely. It's messy, but it is very hard to do anything about that. I think Java shares in that perspective. This corresponds, I think, to Java needing to support pre-Unicode regex semantics on \w and related escapes. If they had started out with it always means the real thing the way ICU did, they wouldn't need both. I wish there were a pragma to control this on a per-lexical-scope basis, but I'm don't enough about the Java compilers internals to begin to know how to go about implementing some thing like that, even as a -XX:+UseUnicodeSemantics CLI switch for that compilation unit. One reason you want this is because the Java String class has these "convenience" methods like matches, replaceAll, etc, that take regexes but do not provide an API that admits Pattern compile flags. If there is no way to embed a (?U) directive or some such, nor any way to pass in a Pattern.UNICODE_something flag. The Java String API could also be broadened through method signature overloading, but for now, you can't do that. No matter what the UNICODE_something gets called, I think there needs to be a corresponding embeddable (?X)-style flag as well. Even if String were broadened, you'd want people to be able to specify *within the regex* that that regex should have full Unicode semantics. After all, they might read the pattern in from a file. That's why (most) Pattern.compile flags need to be able to embedded, too. But you knew that already. :) --tom
participants (3)
-
Mark Davis ☕
-
Tom Christiansen
-
Xueming Shen