Unicode script support in Regex and Character class

Xueming Shen xueming.shen at oracle.com
Fri Apr 30 19:48:13 UTC 2010


Hi,

#4860714 has been closed as a dup (to workaround an internal process 
problem) of my newly created
#6948903 for the regex script support.

So here are the CCC drafts for

6945564: Unicode script support in Character class
6948903: Make Unicode scripts available for use in regular expressions

http://cr.openjdk.java.net/~sherman/script/6948903.htm
http://cr.openjdk.java.net/~sherman/script/6945564.htm

The blenderrevs are

http://cr.openjdk.java.net/~sherman/script/blenderrev_pattern.html
http://cr.openjdk.java.net/~sherman/script/blenderrev_ch.html

Thanks,
Sherman


Xueming Shen wrote:
>
> Can I assume we are all OK with at least the API part of the latest 
> webrev/blenderrev of
> the script support in j.l.Character and j.u.r.Pattern, including the 
> j.l.Chareacter.getName().
>
> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
> http://cr.openjdk.java.net/~sherman/script/webrev
>
> Okutsu-san, Yuka, can one of you help review the corresponding CCC at
> http://ccc.sfbay.sun.com/6945564?
>
> This is for the j.l.Character part only. I'm still trying to figure 
> out how to take over
> the ownership of 4860714 in CCC system, we have a placeholder for this 
> one in
> CCC back to 2003.
>
> Thanks,
> -Sherman
>
>
>
>
> Xueming Shen wrote:
>> Martin Buchholz wrote:
>>> Providing script support is obvious and non-controversial,
>>> because other regex programming environments provide it.
>>> Check that the behavior and syntax of the extension is
>>> consistent with e.g. ICU, python, and especially perl
>>> (5.12 just released!)
>>>
>>> http://perldoc.perl.org/perlunicode.html
>>>   
>>
>> \p{propName=propValue} is the unicode "compound form", which is 
>> supported in
>> perl 5.12. It also has a variant type \p{propName:propValue}. It was 
>> in my proposal,
>> but I removed it the last minutes. Two forms (\p{In/IsProp} and 
>> \p{propName=propValue}
>> should be good enough for now. Three is a little too much. We can 
>> always add it
>> in later, if desirable.
>>
>> \p{IsScript}, \p{Isgc}, \p{InBlock} are perl compatible as well.
>>
>>> I would add some documentation to the three special script values;
>>> their meaning is not obvious.
>>>
>>>   
>> I think it might be better to justt leave the detailed explain doc to 
>> the TR#24. The "script"
>> here in j.l.Character serves only the purpose of id, the API here 
>> should not be the place
>> to explain "what they really are".
>>
>>> For implementation, the character matching problem is in general
>>> equivalent to the problem of compiling a switch statement, which is
>>> known to be non-trivial.  Guava contains a CharMatcher class that
>>> tries to solve related problems.
>>>
>>> http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html 
>>>
>>>
>>> I'm thinking scripts and blocks should know about which ranges they 
>>> contain.
>>> In particular, \p{BlockName} should not need binary search at
>>> regex compile time or runtime.
>>>   
>> It definitely is desirable if we can avoid the binary-search lookup 
>> during at least the runtime. The
>> cost will be to keep a separate/redundant block/script->ranges table 
>> in regex itself.
>>
>>> ---
>>> There is one place you need to change
>>> key word => keyword
>>> ---
>>> InMongolian => {@code InMongolian}
>>> ---
>>>   
>>
>> Good catch, thanks!
>>
>>> I notice current Unicode block support in JDK is not updated to the
>>> latest standard.
>>> E.g. Samaritan is missing.
>>>
>>>   
>> The Character class has not been updated to the latest 5.20 yet. Yuka 
>> has a CCC pending for
>> that. My script data is from the 5.20.
>>
>>
>>> Martin
>>>
>>> On Thu, Apr 22, 2010 at 01:01, Xueming Shen 
>>> <xueming.shen at oracle.com> wrote:
>>>  
>>>> Hi,
>>>>
>>>> Here is the webrev of the proposal to add Unicode script support in 
>>>> regex
>>>> and j.l.Character.
>>>>
>>>> http://cr.openjdk.java.net/~sherman/script/webrev
>>>>
>>>> and the corresponding blenderrev
>>>>
>>>> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
>>>>
>>>> Please comment on the APIs before I submit the CCC, especially
>>>>
>>>> (1) to use enum for the j.l.Character.UnicodeScript (compared to the
>>>> traditional j.l.c.Subset)
>>>> (2) the piggyback method j.l.c.getName() :-)
>>>> (3) the syntax for script constructs. In addition to the "normal"
>>>>    \p{InScriptName} and \P{InScriptName} for the script support
>>>>    I'm also adding
>>>>   \p{script=ScriptName} \P{script=ScriptName}  for the new script 
>>>> support
>>>>   \p{block=BlockName} \P{block=BlockName}  for the "existing" block 
>>>> support
>>>>   \p{general_category=CategoryName} 
>>>> \P{general_category=CategoryName} for
>>>> the "existing" gc
>>>>   Perl recently also started to accept this  \p{propName=propValue} 
>>>> Unicode
>>>> style.
>>>>   It opens the door for future "expanding", for example 
>>>> \p{name=XYZ} :-)
>>>> (4)and of course, the wording.
>>>>
>>>> Thanks,
>>>> Sherman
>>>>
>>>>
>>>>
>>>>     
>>
>>
>
>




More information about the core-libs-dev mailing list