<i18n dev> Java and Unicode

Sat Dec 11 23:01:29 PST 2010

  Hi Tom,

Thanks for looking into the Unicode support issues in Java RegEx.

Since you haven been working on Unicode in the past decade, I'm sure you 
understand that most
of the issues you are pointing out here belongs to the "Extended Unicode 
Support: Level 2" as
documented in UTS#18 Unicode Regular Expressions [2]. Unfortunately the 
current Java RegEx
implementation only supports the "Basic Unicode Support: Level 1",  as 
specified in Java RegEx
java.util.regex.Pattern API document [1].

I'm aware of and impressed by the Unicode support added "recently" in 
Perl 6, was planning to
close the gap (basically Java RegEx is the implementation that "matches" 
perl 5) in JDK7. Due
to resource issue I only managed to add in the script and name support 
in RegEx and Character
class. hope I can have more in the next couple months otherwise the rest 
will be deferred to JDK8
(the \X probably is the most important one next on my list)

As regarding the POSIX properties. In Java RegEx Unicode Alphabetic, 
Lowercase or Whitespace
properties are supported by using \p{javaLetter},  \p{javaLowerCase}, 
\p{javaUpperCase} or
\p{javaWhitespace}. The \p{Lower/Upper/ASCII/Alpha...}, as noticed, are 
clearly specified by the
Java RegEx specification[1] that are for US_ASCII only (does Perl 5 work 
in this way as well?)
This is by design and I don't agree "this is a mess" conclusion. While 
there are developers over
there might like these properties to evolve to be the Unicode 
properties, I am pretty much sure
there might be the same amount of developers there would prefer these 
properties be kept as
the "original" POSIX properties. At the end, Java RegEx is NOT a Unicode 
RegEx, while it supports
Unicode RegEx at certain level, sometime via different syntax, I don't 
feel this is a big problem for
most Java developers and should not be a stopper for most program.

I would also like to point out that Java is NOT a RegEx based 
language/platform, RegEx is
not part of the Java language (I means the language specification), it 
is one of the utility packages
in Java platform's core libraries. So even certain Unicode Properties 
are not yet supported by Java
RegEx, it does not means they are not supported by the platform, you 
should be able to access those
Unicode properties via java.lang.Character class[4]. So I would strong 
disagree the comment
that "Java’s Unicode property support is *strictly antemillennial*, by 
which I mean it supports no
Unicode property that has come out in the last decade.":-)  Even Java 
RegEx is NOT that bad,
the script, block can category property support are pretty "up to date".

Anyway, as I said we do have "plan" to improve the Unicode Regex support 
in Java RegEx
and are adding more pieces into it, while it might be a little slower 
than people would like to
see (currently I can only spend less than 5% of my time to RegEx for 
JDK7, hope I can allocate
more time the next couple months). The good news is that Java is now a 
open source project/
platform, I'm sure your decade of experience in Unicode and Perl would 
definite help should
you decide to contribute [5]. Even without direct code contribution, it 
would still benefit the
java community if you can spend some time to list all your concerns 
about the Unicode
support in Java RegEx, I promise I will go through them one by one (I 
will look into [3]
next week in more details next week) .

I believe most of the Java Unicode "expert" are on this mailing list, so 
we can start from here.

Thanks,
Sherman

[1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html
[2] http://www.unicode.org/reports/tr18
[3] 
http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261
[4] http://download.java.net/jdk7/docs/api/java/lang/Character.html
[5] http://openjdk.java.net/contribute/

On 12/11/2010 09:38, Tom Christiansen wrote:
> Good morning,
>
> I'm Tom Christiansen; some of you may know me from my work in the Perl
> Community.  I'm here at the urging of Martijn Verburg, who thought that my
> recent discoveries should be heard by your group.
>
> I've been professionally programming for more than 25 years now, mostly in
> C and Perl.  I recently joined the biomedical text-mining group at the
> University of Colorado, where the bulk of our code base is in Java.
>
> I've been responsible for working with large text corpora entirely in
> Unicode.  For example, one corpus comprises almost 200,000 papers and 11
> gigabytes, while another is a single file of 6 gigabytes.  I'm not new to
> Unicode, having worked with it a great deal over the last decade.
>
> Although most of our code base is in Java, we also have a considerable
> portion of Perl code and some Python code, too.  This code often first
> tokenizes the input stream before moving on to more sophisticated semantic
> processing.  I was quite surprised to learn how differently Java treated
> Unicode text than how the same text is treated by Perl and Python, even
> using identical regular expressions.  This has proved to be a significant
> barrier to fully adopting Java for our Unicode work.
>
> This prompted me to make a comprehensive study of Unicode issues in Java,
> focusing on regular expressions but also exploring other areas.  I've
> identified about two dozen individual areas that I feel deserve to be
> looked at.  These range from mismatches between documentation and behavior,
> to unfortunate or inconvenient defaults (e.g. "documented not to work"), to
> genuine bugs and international standards violations.
>
> Taken as a whole, these problem areas make Java a very difficult choice for
> the sort of text processing my group needs to use it for.  Surely many
> others all around the world are in a similar position.
>
> I've searched the archives for this mailing list, and have found no mention
> of these troubles either there, or indeed anywhere at all on the web.  For
> example:
>
>      http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8
>
> I have working code that fixes what for us is the most egregious of these
> problems: that regexes were unusable on Unicode.  One fundamental bug is
> that Java has misunderstood the connection between \b and \w regexes, so
> that now a string like "élève" is not matched by the pattern "\b\w+\b" at
> any point in the string.
>
> Other very serious problems include Java's unjustifiable demotion of legal
> Unicode whitespace characters from the set of whitespace characters
> (breaking tokenization), using Unicode property names in ways contrary to
> what the spec says they do, and in general supporting no Unicode properties
> any later than 3.0: even the critical Unicode 3.1 properties are ignored by
> Java.  These are very serious problems.  Java almost cannot be said to
> support Unicode--at least any Unicode release from the last ten
> years--until these critical deficiencies are fixed.
>
> You can find a brief synopsis of these specific troubles as well as a link
> to the Java code that fixes them here:
>
>      http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261
>
> I don't by any means think this is the best way to go about this.  It's
> just a band-aide we needed quickly to allow us to move on with our work.
> I'd like to offer it as a starting point for discussion of the issues that
> prompted its creation.
>
> As I mentioned, I have a couple dozen different Java Unicode issues, and
> this addresses just one or two of them.  When I get time, I'll try to bring
> up the others here in separate threads.
>
> If you could advise me how best to contribute to helping out here, I would
> be grateful.
>
> Thank you,
>
> --tom

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20101211/6aedeb6c/attachment.html