<i18n dev> Java Regexes vs Unicode Regexes

Thu Jan 20 12:55:25 PST 2011

Sherman wrote:

> At the end, Java RegEx is NOT a Unicode RegEx, while it
> supports Unicode RegEx at certain level, sometime via different
> syntax, I don't feel this is a big problem for most Java
> developers and should not be a stopper for most program.

I do not understand what you mean when you say that Java regexes
aren't Unicode regexes.  Are you referring to the various
syntactic features of  UTS 18, Unicode Regular Expressions?
If so, it's my understanding that many of those are examples
only, especially when it comes to how something actually looks.

I fully agree with you that Java indeed offers some of the
functionality described there in other ways than given by those
particular examples, and that quite often this doesn't make
enough practical difference as to be a show-stopper.  I discuss
this further later on down in this message.

Another possible interpretation of:

> Java RegEx is NOT a Unicode RegEx, while it
> supports Unicode RegEx at certain level, 

is that you are saying that the standard Java regex
class does not provide the baseline Level 1 Unicode
support spelled out in UTS#18, then I'm afraid you
are again correct.  

However, I would very much like to see this fixed.  That's
because Level 1 support is the absolute mimimum level required
for useful Unicode support. To quote from UTS#18:

    Level 1 is the minimally useful level of support for Unicode.
    All regex implementations dealing with Unicode should be at
    least at Level 1.

I believe it *extremely important* that Java provide useful Unicode
support.  In my text-mining group at the university here, we process
megabytes and sometimes gigabytes of UTF-8 text with Java.  And we use
regexes.  For us it is a *very* big problem that Java does not provide
even the minimally required Level 1 support, because there is only so
much you can do to work around this; that's why they call Level 1
"minimally useful".

Because Java's native character set is and always has been
Unicode, I feel it is is reasonable to hope that Java should
provide the minimally useful level of support for Unicode.  The
exponential(*) growth in the proportion of Unicode text data over
the last decade means that Java is suddenly not well-suited to
handle this data.  This is a real shame.  

    (*) I use here the term "exponential growth" purely in its
	mathematically strict sense, not in the more commonly 
	heard popular sense of merely growing faster than expected.

There is no question that Java is the premiere platform of choice
for millions of people doing real work.  Because of the shocking
growth rate of Unicode, it has "suddenly" come time that the whole
Java infrastructure fully support basic Unicode, just as much as 
it does ASCII.  That's what the future is, and the future is now.

It is not enough to say "Oh well, use another language then," because
that is not a viable option to programming shops who are fully committed
to Java as a programming platform.  That's why it needs to be there.

In later messages I'll discuss the particulars of precisely where Java
already does manage Level 1 Unicode support, where it is missing out, 
and what needs to be done to bring it into not merely compliance but
also usefulness--which I actually hold to be of greater importance.

Sherman, please understand that I am not asking you to do all this
work!! That would be not just impolite but also impractical and possibly
even impossible.  I do recognize that it is too much for just one
person.  I do not ask that.  I just want to detail where the holes are.
I very much hope you will take no offence by this; I assure you that
absolutely none is intended!  

--tom