<i18n dev> Java Regexes vs Unicode Regexes

Xueming Shen xueming.shen at oracle.com
Thu Jan 20 13:53:01 PST 2011


On 01/20/2011 12:55 PM, Tom Christiansen wrote:
> Sherman wrote:
>
>> At the end, Java RegEx is NOT a Unicode RegEx, while it
>> supports Unicode RegEx at certain level, sometime via different
>> syntax, I don't feel this is a big problem for most Java
>> developers and should not be a stopper for most program.
> I do not understand what you mean when you say that Java regexes
> aren't Unicode regexes.  Are you referring to the various
> syntactic features of  UTS 18, Unicode Regular Expressions?
> If so, it's my understanding that many of those are examples
> only, especially when it comes to how something actually looks.
>
> I fully agree with you that Java indeed offers some of the
> functionality described there in other ways than given by those
> particular examples, and that quite often this doesn't make
> enough practical difference as to be a show-stopper.  I discuss
> this further later on down in this message.
>
> Another possible interpretation of:
>
>> Java RegEx is NOT a Unicode RegEx, while it
>> supports Unicode RegEx at certain level,
> is that you are saying that the standard Java regex
> class does not provide the baseline Level 1 Unicode
> support spelled out in UTS#18, then I'm afraid you
> are again correct.
>
> However, I would very much like to see this fixed.  That's
> because Level 1 support is the absolute mimimum level required
> for useful Unicode support. To quote from UTS#18:

Hi Tom,

That is NOT what I'm saying.

The Java RegEx is supposed to be "in conformance with level 1 of UTS#18 
plus RL2.1
Canonical Equivalents", so anything defined in UTS#18 level one should 
be supported
by Java RegEx, though might not be the exact same syntax 
defined/recommended by
UTS#18 or just work out of the box, for example the Unicode case 
insensitive match,
you will have to specify a particular "flag" to turn it on, basically 
for performance reason.

Really appreciate if you can provide the details of what is missing out 
for the level one
support, given that would be a specification broken I definitely can put 
it on high priority
list to work on. The script support is one of the level one request that 
we don't have it in
our latest release, but I have added it in the up coming jdk7.  I'm sure 
there are bugs
and corner cases here and there even we have lots of tests supposedly to 
cover everything:-)

Had been dedicatedly working on Java I18n for years, so I fully 
understand how important
the Unicode is, especially for Java as the platform. And it's our goal 
to have java provide
the most useful Unicode support, it would be the last thing for me to 
say "go pick other
language/platform". No, I don't feel any offense at all. In fact we are 
really appreciated
these useful comments, suggestions, expertise, which will definitely 
help evolve the platform.

-Sherman




More information about the i18n-dev mailing list