RFR: JDK-8143282: \p{Cn} unassigned code points should be included in \p{C}

Xueming Shen xueming.shen at oracle.com
Fri May 20 18:10:28 UTC 2016


On 5/20/16 10:13 AM, Martin Buchholz wrote:
> On Fri, May 20, 2016 at 9:55 AM, Xueming Shen <xueming.shen at oracle.com> wrote:
>>> I expected to see general category Other "C" in Character.java
>>
>> can open a rfe for that if needed.
> Well, don't we want complete correspondence between Unicode standard,
> Character, and regex?
> Anything missing seems like a bug, not rfe!
>
>>> I'd like to see tests that p{C} is the same as p{Other} is the same as
>>> p{isOther} and similar with other categories.
>>
>> Did you mean you want to add the "long name" support for unicode category?
> I expect \p{C} and \p{Other} and \p{isOther} all to work (haven't tried it).
> Is that not a reasonable expectation?
>

I'm the big fan of regex unicode support :-)

LC/L/M/N/P/S/Z/C are special gc, they are "groupings of related gc 
values". While j.l.Character
does support the general category via getType() == gc_xyz, it does not 
explicitly have support
for such grouping values for obvious reason (1:2), we have various 
isXXXXX() methods, but they
are not specified as the equivalent to those gc_groupings. So yes, it 
would be a rfe, if you want
them supported in j.u.Character class (such as Character.isType(int cp, 
int type))

Though lots of properties supported have been added in j.u.regex, it's 
still not completed. Names,
such as "Other", "Number" are not supported, yet. "Letter" is ok as it 
falls back into the same
posix name. The supported "property name" list is limited, as listed 
with CharPredicates.defUProp.
It should be fine to add "Other", and other "long name" for the unicode 
gc, but it appears we might
have a name space conflict for "letter" (posix or unicode). It should be 
safe to do \p{gc=xyz}. Again,
it is more like a rfe now :-)

-Sherman



More information about the core-libs-dev mailing list