<i18n dev> Java and Unicode

Martijn Verburg martijnverburg at gmail.com
Sat Dec 11 09:58:50 PST 2010


Hi all,

Just to add a little background here :).  I'm Martijn (I help run the London
JUG FWIW) and I ran across an answer from Tom on StackOverflow about Unicode
issues in Java - I quickly deleted my answer!

It was one of _those_ answers which really impressed all of us Java
developers on that thread, especially those who knew a little about Unicode
(I don't really count myself as one of them!).  So I asked Tom if he'd mind
volunteering some of his time here as I knew there was some Unicode 6.0 work
going on and as he has a PERL and Unicode background I thought he would be
able to contribute in the discussions and work here (unlike someone like me
who's eyes glaze over if I have to do anything more complicated than setting
a character encoding).

I met a few of the OpenJDK advocates at Devoxx and that's inspired me so I'm
happy to try and help out Tom on the Java side where I can (or more
importantly try to get enthusiastic volunteers from my JUG to help out ;p).

Cheers,
Martijn
twitter - @karianna & @java7developer

On Sat, Dec 11, 2010 at 5:38 PM, Tom Christiansen <tchrist at perl.com> wrote:

> Good morning,
>
> I'm Tom Christiansen; some of you may know me from my work in the Perl
> Community.  I'm here at the urging of Martijn Verburg, who thought that my
> recent discoveries should be heard by your group.
>
> I've been professionally programming for more than 25 years now, mostly in
> C and Perl.  I recently joined the biomedical text-mining group at the
> University of Colorado, where the bulk of our code base is in Java.
>
> I've been responsible for working with large text corpora entirely in
> Unicode.  For example, one corpus comprises almost 200,000 papers and 11
> gigabytes, while another is a single file of 6 gigabytes.  I'm not new to
> Unicode, having worked with it a great deal over the last decade.
>
> Although most of our code base is in Java, we also have a considerable
> portion of Perl code and some Python code, too.  This code often first
> tokenizes the input stream before moving on to more sophisticated semantic
> processing.  I was quite surprised to learn how differently Java treated
> Unicode text than how the same text is treated by Perl and Python, even
> using identical regular expressions.  This has proved to be a significant
> barrier to fully adopting Java for our Unicode work.
>
> This prompted me to make a comprehensive study of Unicode issues in Java,
> focusing on regular expressions but also exploring other areas.  I've
> identified about two dozen individual areas that I feel deserve to be
> looked at.  These range from mismatches between documentation and behavior,
> to unfortunate or inconvenient defaults (e.g. "documented not to work"), to
> genuine bugs and international standards violations.
>
> Taken as a whole, these problem areas make Java a very difficult choice for
> the sort of text processing my group needs to use it for.  Surely many
> others all around the world are in a similar position.
>
> I've searched the archives for this mailing list, and have found no mention
> of these troubles either there, or indeed anywhere at all on the web.  For
> example:
>
>
> http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8
>
> I have working code that fixes what for us is the most egregious of these
> problems: that regexes were unusable on Unicode.  One fundamental bug is
> that Java has misunderstood the connection between \b and \w regexes, so
> that now a string like "élève" is not matched by the pattern "\b\w+\b" at
> any point in the string.
>
> Other very serious problems include Java's unjustifiable demotion of legal
> Unicode whitespace characters from the set of whitespace characters
> (breaking tokenization), using Unicode property names in ways contrary to
> what the spec says they do, and in general supporting no Unicode properties
> any later than 3.0: even the critical Unicode 3.1 properties are ignored by
> Java.  These are very serious problems.  Java almost cannot be said to
> support Unicode--at least any Unicode release from the last ten
> years--until these critical deficiencies are fixed.
>
> You can find a brief synopsis of these specific troubles as well as a link
> to the Java code that fixes them here:
>
>
> http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261
>
> I don't by any means think this is the best way to go about this.  It's
> just a band-aide we needed quickly to allow us to move on with our work.
> I'd like to offer it as a starting point for discussion of the issues that
> prompted its creation.
>
> As I mentioned, I have a couple dozen different Java Unicode issues, and
> this addresses just one or two of them.  When I get time, I'll try to bring
> up the others here in separate threads.
>
> If you could advise me how best to contribute to helping out here, I would
> be grateful.
>
> Thank you,
>
> --tom
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20101211/c8228444/attachment.html 


More information about the i18n-dev mailing list