<i18n dev> Java and Unicode

Sat Dec 11 09:38:55 PST 2010

Good morning,

I'm Tom Christiansen; some of you may know me from my work in the Perl
Community.  I'm here at the urging of Martijn Verburg, who thought that my
recent discoveries should be heard by your group.

I've been professionally programming for more than 25 years now, mostly in
C and Perl.  I recently joined the biomedical text-mining group at the
University of Colorado, where the bulk of our code base is in Java.  

I've been responsible for working with large text corpora entirely in
Unicode.  For example, one corpus comprises almost 200,000 papers and 11
gigabytes, while another is a single file of 6 gigabytes.  I'm not new to
Unicode, having worked with it a great deal over the last decade.

Although most of our code base is in Java, we also have a considerable
portion of Perl code and some Python code, too.  This code often first
tokenizes the input stream before moving on to more sophisticated semantic
processing.  I was quite surprised to learn how differently Java treated
Unicode text than how the same text is treated by Perl and Python, even
using identical regular expressions.  This has proved to be a significant
barrier to fully adopting Java for our Unicode work.

This prompted me to make a comprehensive study of Unicode issues in Java,
focusing on regular expressions but also exploring other areas.  I've
identified about two dozen individual areas that I feel deserve to be
looked at.  These range from mismatches between documentation and behavior,
to unfortunate or inconvenient defaults (e.g. "documented not to work"), to
genuine bugs and international standards violations.

Taken as a whole, these problem areas make Java a very difficult choice for
the sort of text processing my group needs to use it for.  Surely many
others all around the world are in a similar position.

I've searched the archives for this mailing list, and have found no mention
of these troubles either there, or indeed anywhere at all on the web.  For
example:

    http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8

I have working code that fixes what for us is the most egregious of these
problems: that regexes were unusable on Unicode.  One fundamental bug is
that Java has misunderstood the connection between \b and \w regexes, so
that now a string like "élève" is not matched by the pattern "\b\w+\b" at
any point in the string.

Other very serious problems include Java's unjustifiable demotion of legal
Unicode whitespace characters from the set of whitespace characters
(breaking tokenization), using Unicode property names in ways contrary to
what the spec says they do, and in general supporting no Unicode properties
any later than 3.0: even the critical Unicode 3.1 properties are ignored by
Java.  These are very serious problems.  Java almost cannot be said to
support Unicode--at least any Unicode release from the last ten
years--until these critical deficiencies are fixed.

You can find a brief synopsis of these specific troubles as well as a link
to the Java code that fixes them here:

    http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261

I don't by any means think this is the best way to go about this.  It's
just a band-aide we needed quickly to allow us to move on with our work.
I'd like to offer it as a starting point for discussion of the issues that
prompted its creation.

As I mentioned, I have a couple dozen different Java Unicode issues, and
this addresses just one or two of them.  When I get time, I'll try to bring
up the others here in separate threads.

If you could advise me how best to contribute to helping out here, I would
be grateful.

Thank you,

--tom