Unicode script support in Regex and Character class
Martin Buchholz
martinrb at google.com
Sat Apr 24 18:21:20 UTC 2010
Providing script support is obvious and non-controversial,
because other regex programming environments provide it.
Check that the behavior and syntax of the extension is
consistent with e.g. ICU, python, and especially perl
(5.12 just released!)
http://perldoc.perl.org/perlunicode.html
I would add some documentation to the three special script values;
their meaning is not obvious.
For implementation, the character matching problem is in general
equivalent to the problem of compiling a switch statement, which is
known to be non-trivial. Guava contains a CharMatcher class that
tries to solve related problems.
http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html
I'm thinking scripts and blocks should know about which ranges they contain.
In particular, \p{BlockName} should not need binary search at
regex compile time or runtime.
---
There is one place you need to change
key word => keyword
---
InMongolian => {@code InMongolian}
---
I notice current Unicode block support in JDK is not updated to the
latest standard.
E.g. Samaritan is missing.
Martin
On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.shen at oracle.com> wrote:
> Hi,
>
> Here is the webrev of the proposal to add Unicode script support in regex
> and j.l.Character.
>
> http://cr.openjdk.java.net/~sherman/script/webrev
>
> and the corresponding blenderrev
>
> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
>
> Please comment on the APIs before I submit the CCC, especially
>
> (1) to use enum for the j.l.Character.UnicodeScript (compared to the
> traditional j.l.c.Subset)
> (2) the piggyback method j.l.c.getName() :-)
> (3) the syntax for script constructs. In addition to the "normal"
> \p{InScriptName} and \P{InScriptName} for the script support
> I'm also adding
> \p{script=ScriptName} \P{script=ScriptName} for the new script support
> \p{block=BlockName} \P{block=BlockName} for the "existing" block support
> \p{general_category=CategoryName} \P{general_category=CategoryName} for
> the "existing" gc
> Perl recently also started to accept this \p{propName=propValue} Unicode
> style.
> It opens the door for future "expanding", for example \p{name=XYZ} :-)
> (4)and of course, the wording.
>
> Thanks,
> Sherman
>
>
>
More information about the core-libs-dev
mailing list