Hi This proposal tries to address (1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] requirement as Tom pointed out in his email on i18n-dev list [2]. Basically we have 3 problems here. a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} + \p{digit} as the "word" definition when the standard requires the true Unicode \p{Alphabetic} property be used instead. It also neglects two of the specifically required characters: U+200C ZERO WIDTH NON-JOINER U+200D ZERO WIDTH JOINER (or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit + \p{gc=Connector_Punctuation}, if follow Annex C). b. j.u.regex's word construct \w and \W are ASCII only version c. It breaks the historical connection between word characters and word boundaries (because of a) and b). For example "élève" is NOT matched by the \b\w+\b pattern) (2) j.u.regex does not meet Unicode regex's Properties requirement [3][5][6][7]. Th main issues are a. Alphabetic: totally missing from the platform, not only regex b. Lowercase, Uppercase and White_Space: Java implementation (via \p{javaMethod} is different compared to Unicode Standard definition. c. j.u.regex's POSIX character classes are ASCII only, when standard has an Unicode version defined at tr#18 Annex C [3] As the solution, I propose to (1) add a flag UNICODE_UNICODE to a) flip the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version b) enable the UNICODE_CASE (anything Unicode) While ideally we would like to just evolve/upgrade the Java regex from the aged "ascii-only" to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)), like what Perl did. But given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to happen. (2) add \p{IsBinaryProperty} to explicitly support some important Unicode binary properties, such as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this j.u.regex can easily access some properties that are either not provided by j.l.Character directly or j.l.Character has a different version (for example the White_Space). (The missing alphabetic, different uppercase/lowercase issue has been/is being addressed at Cr#7037261 [4], any reviewer?) The webrev is at http://cr.openjdk.java.net/~sherman/7039066/webrev/ The corresponding updated api j.u.regex.Pattern API doc is at http://cr.openjdk.java.net/~sherman/7039066/Pattern.html Specdiff result is at http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html I will file the CCC request if the API change proposal in webrev is approved. This is coming in very late so it is possible that it may be held back until Java 8, if it can not make the cutoff for jdk7. -Sherman [1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries [2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties [4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html [5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html [6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html [7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html