Unicode script support in Regex and Character class

Xueming Shen xueming.shen at oracle.com
Mon Apr 26 05:28:56 UTC 2010


Can I assume we are all OK with at least the API part of the latest 
webrev/blenderrev of
the script support in j.l.Character and j.u.r.Pattern, including the 
j.l.Chareacter.getName().

http://cr.openjdk.java.net/~sherman/script/blenderrev.html
http://cr.openjdk.java.net/~sherman/script/webrev

Okutsu-san, Yuka, can one of you help review the corresponding CCC at
http://ccc.sfbay.sun.com/6945564?

This is for the j.l.Character part only. I'm still trying to figure out 
how to take over
the ownership of 4860714 in CCC system, we have a placeholder for this 
one in
CCC back to 2003.

Thanks,
-Sherman




Xueming Shen wrote:
> Martin Buchholz wrote:
>> Providing script support is obvious and non-controversial,
>> because other regex programming environments provide it.
>> Check that the behavior and syntax of the extension is
>> consistent with e.g. ICU, python, and especially perl
>> (5.12 just released!)
>>
>> http://perldoc.perl.org/perlunicode.html
>>   
>
> \p{propName=propValue} is the unicode "compound form", which is 
> supported in
> perl 5.12. It also has a variant type \p{propName:propValue}. It was 
> in my proposal,
> but I removed it the last minutes. Two forms (\p{In/IsProp} and 
> \p{propName=propValue}
> should be good enough for now. Three is a little too much. We can 
> always add it
> in later, if desirable.
>
> \p{IsScript}, \p{Isgc}, \p{InBlock} are perl compatible as well.
>
>> I would add some documentation to the three special script values;
>> their meaning is not obvious.
>>
>>   
> I think it might be better to justt leave the detailed explain doc to 
> the TR#24. The "script"
> here in j.l.Character serves only the purpose of id, the API here 
> should not be the place
> to explain "what they really are".
>
>> For implementation, the character matching problem is in general
>> equivalent to the problem of compiling a switch statement, which is
>> known to be non-trivial.  Guava contains a CharMatcher class that
>> tries to solve related problems.
>>
>> http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html 
>>
>>
>> I'm thinking scripts and blocks should know about which ranges they 
>> contain.
>> In particular, \p{BlockName} should not need binary search at
>> regex compile time or runtime.
>>   
> It definitely is desirable if we can avoid the binary-search lookup 
> during at least the runtime. The
> cost will be to keep a separate/redundant block/script->ranges table 
> in regex itself.
>
>> ---
>> There is one place you need to change
>> key word => keyword
>> ---
>> InMongolian => {@code InMongolian}
>> ---
>>   
>
> Good catch, thanks!
>
>> I notice current Unicode block support in JDK is not updated to the
>> latest standard.
>> E.g. Samaritan is missing.
>>
>>   
> The Character class has not been updated to the latest 5.20 yet. Yuka 
> has a CCC pending for
> that. My script data is from the 5.20.
>
>
>> Martin
>>
>> On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.shen at oracle.com> 
>> wrote:
>>  
>>> Hi,
>>>
>>> Here is the webrev of the proposal to add Unicode script support in 
>>> regex
>>> and j.l.Character.
>>>
>>> http://cr.openjdk.java.net/~sherman/script/webrev
>>>
>>> and the corresponding blenderrev
>>>
>>> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
>>>
>>> Please comment on the APIs before I submit the CCC, especially
>>>
>>> (1) to use enum for the j.l.Character.UnicodeScript (compared to the
>>> traditional j.l.c.Subset)
>>> (2) the piggyback method j.l.c.getName() :-)
>>> (3) the syntax for script constructs. In addition to the "normal"
>>>    \p{InScriptName} and \P{InScriptName} for the script support
>>>    I'm also adding
>>>   \p{script=ScriptName} \P{script=ScriptName}  for the new script 
>>> support
>>>   \p{block=BlockName} \P{block=BlockName}  for the "existing" block 
>>> support
>>>   \p{general_category=CategoryName} 
>>> \P{general_category=CategoryName} for
>>> the "existing" gc
>>>   Perl recently also started to accept this  \p{propName=propValue} 
>>> Unicode
>>> style.
>>>   It opens the door for future "expanding", for example \p{name=XYZ} 
>>> :-)
>>> (4)and of course, the wording.
>>>
>>> Thanks,
>>> Sherman
>>>
>>>
>>>
>>>     
>
>




More information about the core-libs-dev mailing list