Unicode script support in Regex and Character class

Xueming Shen xueming.shen at oracle.com
Sun Apr 25 06:56:44 UTC 2010


Martin Buchholz wrote:
> Providing script support is obvious and non-controversial,
> because other regex programming environments provide it.
> Check that the behavior and syntax of the extension is
> consistent with e.g. ICU, python, and especially perl
> (5.12 just released!)
>
> http://perldoc.perl.org/perlunicode.html
>   

\p{propName=propValue} is the unicode "compound form", which is supported in
perl 5.12. It also has a variant type \p{propName:propValue}. It was in 
my proposal,
but I removed it the last minutes. Two forms (\p{In/IsProp} and 
\p{propName=propValue}
should be good enough for now. Three is a little too much. We can always 
add it
in later, if desirable.

\p{IsScript}, \p{Isgc}, \p{InBlock} are perl compatible as well.

> I would add some documentation to the three special script values;
> their meaning is not obvious.
>
>   
I think it might be better to justt leave the detailed explain doc to 
the TR#24. The "script"
here in j.l.Character serves only the purpose of id, the API here should 
not be the place
to explain "what they really are".

> For implementation, the character matching problem is in general
> equivalent to the problem of compiling a switch statement, which is
> known to be non-trivial.  Guava contains a CharMatcher class that
> tries to solve related problems.
>
> http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html
>
> I'm thinking scripts and blocks should know about which ranges they contain.
> In particular, \p{BlockName} should not need binary search at
> regex compile time or runtime.
>   
It definitely is desirable if we can avoid the binary-search lookup 
during at least the runtime. The
cost will be to keep a separate/redundant block/script->ranges table in 
regex itself.

> ---
> There is one place you need to change
> key word => keyword
> ---
> InMongolian => {@code InMongolian}
> ---
>   

Good catch, thanks!

> I notice current Unicode block support in JDK is not updated to the
> latest standard.
> E.g. Samaritan is missing.
>
>   
The Character class has not been updated to the latest 5.20 yet. Yuka 
has a CCC pending for
that. My script data is from the 5.20.


> Martin
>
> On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.shen at oracle.com> wrote:
>   
>> Hi,
>>
>> Here is the webrev of the proposal to add Unicode script support in regex
>> and j.l.Character.
>>
>> http://cr.openjdk.java.net/~sherman/script/webrev
>>
>> and the corresponding blenderrev
>>
>> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
>>
>> Please comment on the APIs before I submit the CCC, especially
>>
>> (1) to use enum for the j.l.Character.UnicodeScript (compared to the
>> traditional j.l.c.Subset)
>> (2) the piggyback method j.l.c.getName() :-)
>> (3) the syntax for script constructs. In addition to the "normal"
>>    \p{InScriptName} and \P{InScriptName} for the script support
>>    I'm also adding
>>   \p{script=ScriptName} \P{script=ScriptName}  for the new script support
>>   \p{block=BlockName} \P{block=BlockName}  for the "existing" block support
>>   \p{general_category=CategoryName} \P{general_category=CategoryName} for
>> the "existing" gc
>>   Perl recently also started to accept this  \p{propName=propValue} Unicode
>> style.
>>   It opens the door for future "expanding", for example \p{name=XYZ} :-)
>> (4)and of course, the wording.
>>
>> Thanks,
>> Sherman
>>
>>
>>
>>     




More information about the core-libs-dev mailing list