Unicode script support in Regex and Character class
Xueming Shen
xueming.shen at oracle.com
Sun Apr 25 06:56:44 UTC 2010
Martin Buchholz wrote:
> Providing script support is obvious and non-controversial,
> because other regex programming environments provide it.
> Check that the behavior and syntax of the extension is
> consistent with e.g. ICU, python, and especially perl
> (5.12 just released!)
>
> http://perldoc.perl.org/perlunicode.html
>
\p{propName=propValue} is the unicode "compound form", which is supported in
perl 5.12. It also has a variant type \p{propName:propValue}. It was in
my proposal,
but I removed it the last minutes. Two forms (\p{In/IsProp} and
\p{propName=propValue}
should be good enough for now. Three is a little too much. We can always
add it
in later, if desirable.
\p{IsScript}, \p{Isgc}, \p{InBlock} are perl compatible as well.
> I would add some documentation to the three special script values;
> their meaning is not obvious.
>
>
I think it might be better to justt leave the detailed explain doc to
the TR#24. The "script"
here in j.l.Character serves only the purpose of id, the API here should
not be the place
to explain "what they really are".
> For implementation, the character matching problem is in general
> equivalent to the problem of compiling a switch statement, which is
> known to be non-trivial. Guava contains a CharMatcher class that
> tries to solve related problems.
>
> http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html
>
> I'm thinking scripts and blocks should know about which ranges they contain.
> In particular, \p{BlockName} should not need binary search at
> regex compile time or runtime.
>
It definitely is desirable if we can avoid the binary-search lookup
during at least the runtime. The
cost will be to keep a separate/redundant block/script->ranges table in
regex itself.
> ---
> There is one place you need to change
> key word => keyword
> ---
> InMongolian => {@code InMongolian}
> ---
>
Good catch, thanks!
> I notice current Unicode block support in JDK is not updated to the
> latest standard.
> E.g. Samaritan is missing.
>
>
The Character class has not been updated to the latest 5.20 yet. Yuka
has a CCC pending for
that. My script data is from the 5.20.
> Martin
>
> On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.shen at oracle.com> wrote:
>
>> Hi,
>>
>> Here is the webrev of the proposal to add Unicode script support in regex
>> and j.l.Character.
>>
>> http://cr.openjdk.java.net/~sherman/script/webrev
>>
>> and the corresponding blenderrev
>>
>> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
>>
>> Please comment on the APIs before I submit the CCC, especially
>>
>> (1) to use enum for the j.l.Character.UnicodeScript (compared to the
>> traditional j.l.c.Subset)
>> (2) the piggyback method j.l.c.getName() :-)
>> (3) the syntax for script constructs. In addition to the "normal"
>> \p{InScriptName} and \P{InScriptName} for the script support
>> I'm also adding
>> \p{script=ScriptName} \P{script=ScriptName} for the new script support
>> \p{block=BlockName} \P{block=BlockName} for the "existing" block support
>> \p{general_category=CategoryName} \P{general_category=CategoryName} for
>> the "existing" gc
>> Perl recently also started to accept this \p{propName=propValue} Unicode
>> style.
>> It opens the door for future "expanding", for example \p{name=XYZ} :-)
>> (4)and of course, the wording.
>>
>> Thanks,
>> Sherman
>>
>>
>>
>>
More information about the core-libs-dev
mailing list