From tchrist at perl.com Sat Dec 11 09:38:55 2010 From: tchrist at perl.com (Tom Christiansen) Date: Sat, 11 Dec 2010 10:38:55 -0700 Subject: Java and Unicode Message-ID: <5502.1292089135@chthon> Good morning, I'm Tom Christiansen; some of you may know me from my work in the Perl Community. I'm here at the urging of Martijn Verburg, who thought that my recent discoveries should be heard by your group. I've been professionally programming for more than 25 years now, mostly in C and Perl. I recently joined the biomedical text-mining group at the University of Colorado, where the bulk of our code base is in Java. I've been responsible for working with large text corpora entirely in Unicode. For example, one corpus comprises almost 200,000 papers and 11 gigabytes, while another is a single file of 6 gigabytes. I'm not new to Unicode, having worked with it a great deal over the last decade. Although most of our code base is in Java, we also have a considerable portion of Perl code and some Python code, too. This code often first tokenizes the input stream before moving on to more sophisticated semantic processing. I was quite surprised to learn how differently Java treated Unicode text than how the same text is treated by Perl and Python, even using identical regular expressions. This has proved to be a significant barrier to fully adopting Java for our Unicode work. This prompted me to make a comprehensive study of Unicode issues in Java, focusing on regular expressions but also exploring other areas. I've identified about two dozen individual areas that I feel deserve to be looked at. These range from mismatches between documentation and behavior, to unfortunate or inconvenient defaults (e.g. "documented not to work"), to genuine bugs and international standards violations. Taken as a whole, these problem areas make Java a very difficult choice for the sort of text processing my group needs to use it for. Surely many others all around the world are in a similar position. I've searched the archives for this mailing list, and have found no mention of these troubles either there, or indeed anywhere at all on the web. For example: http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8 I have working code that fixes what for us is the most egregious of these problems: that regexes were unusable on Unicode. One fundamental bug is that Java has misunderstood the connection between \b and \w regexes, so that now a string like "?l?ve" is not matched by the pattern "\b\w+\b" at any point in the string. Other very serious problems include Java's unjustifiable demotion of legal Unicode whitespace characters from the set of whitespace characters (breaking tokenization), using Unicode property names in ways contrary to what the spec says they do, and in general supporting no Unicode properties any later than 3.0: even the critical Unicode 3.1 properties are ignored by Java. These are very serious problems. Java almost cannot be said to support Unicode--at least any Unicode release from the last ten years--until these critical deficiencies are fixed. You can find a brief synopsis of these specific troubles as well as a link to the Java code that fixes them here: http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261 I don't by any means think this is the best way to go about this. It's just a band-aide we needed quickly to allow us to move on with our work. I'd like to offer it as a starting point for discussion of the issues that prompted its creation. As I mentioned, I have a couple dozen different Java Unicode issues, and this addresses just one or two of them. When I get time, I'll try to bring up the others here in separate threads. If you could advise me how best to contribute to helping out here, I would be grateful. Thank you, --tom From martijnverburg at gmail.com Sat Dec 11 09:58:50 2010 From: martijnverburg at gmail.com (Martijn Verburg) Date: Sat, 11 Dec 2010 17:58:50 +0000 Subject: Java and Unicode In-Reply-To: <5502.1292089135@chthon> References: <5502.1292089135@chthon> Message-ID: Hi all, Just to add a little background here :). I'm Martijn (I help run the London JUG FWIW) and I ran across an answer from Tom on StackOverflow about Unicode issues in Java - I quickly deleted my answer! It was one of _those_ answers which really impressed all of us Java developers on that thread, especially those who knew a little about Unicode (I don't really count myself as one of them!). So I asked Tom if he'd mind volunteering some of his time here as I knew there was some Unicode 6.0 work going on and as he has a PERL and Unicode background I thought he would be able to contribute in the discussions and work here (unlike someone like me who's eyes glaze over if I have to do anything more complicated than setting a character encoding). I met a few of the OpenJDK advocates at Devoxx and that's inspired me so I'm happy to try and help out Tom on the Java side where I can (or more importantly try to get enthusiastic volunteers from my JUG to help out ;p). Cheers, Martijn twitter - @karianna & @java7developer On Sat, Dec 11, 2010 at 5:38 PM, Tom Christiansen wrote: > Good morning, > > I'm Tom Christiansen; some of you may know me from my work in the Perl > Community. I'm here at the urging of Martijn Verburg, who thought that my > recent discoveries should be heard by your group. > > I've been professionally programming for more than 25 years now, mostly in > C and Perl. I recently joined the biomedical text-mining group at the > University of Colorado, where the bulk of our code base is in Java. > > I've been responsible for working with large text corpora entirely in > Unicode. For example, one corpus comprises almost 200,000 papers and 11 > gigabytes, while another is a single file of 6 gigabytes. I'm not new to > Unicode, having worked with it a great deal over the last decade. > > Although most of our code base is in Java, we also have a considerable > portion of Perl code and some Python code, too. This code often first > tokenizes the input stream before moving on to more sophisticated semantic > processing. I was quite surprised to learn how differently Java treated > Unicode text than how the same text is treated by Perl and Python, even > using identical regular expressions. This has proved to be a significant > barrier to fully adopting Java for our Unicode work. > > This prompted me to make a comprehensive study of Unicode issues in Java, > focusing on regular expressions but also exploring other areas. I've > identified about two dozen individual areas that I feel deserve to be > looked at. These range from mismatches between documentation and behavior, > to unfortunate or inconvenient defaults (e.g. "documented not to work"), to > genuine bugs and international standards violations. > > Taken as a whole, these problem areas make Java a very difficult choice for > the sort of text processing my group needs to use it for. Surely many > others all around the world are in a similar position. > > I've searched the archives for this mailing list, and have found no mention > of these troubles either there, or indeed anywhere at all on the web. For > example: > > > http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8 > > I have working code that fixes what for us is the most egregious of these > problems: that regexes were unusable on Unicode. One fundamental bug is > that Java has misunderstood the connection between \b and \w regexes, so > that now a string like "?l?ve" is not matched by the pattern "\b\w+\b" at > any point in the string. > > Other very serious problems include Java's unjustifiable demotion of legal > Unicode whitespace characters from the set of whitespace characters > (breaking tokenization), using Unicode property names in ways contrary to > what the spec says they do, and in general supporting no Unicode properties > any later than 3.0: even the critical Unicode 3.1 properties are ignored by > Java. These are very serious problems. Java almost cannot be said to > support Unicode--at least any Unicode release from the last ten > years--until these critical deficiencies are fixed. > > You can find a brief synopsis of these specific troubles as well as a link > to the Java code that fixes them here: > > > http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261 > > I don't by any means think this is the best way to go about this. It's > just a band-aide we needed quickly to allow us to move on with our work. > I'd like to offer it as a starting point for discussion of the issues that > prompted its creation. > > As I mentioned, I have a couple dozen different Java Unicode issues, and > this addresses just one or two of them. When I get time, I'll try to bring > up the others here in separate threads. > > If you could advise me how best to contribute to helping out here, I would > be grateful. > > Thank you, > > --tom > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20101211/c8228444/attachment.html From xueming.shen at oracle.com Sat Dec 11 23:01:29 2010 From: xueming.shen at oracle.com (Xueming Shen) Date: Sat, 11 Dec 2010 23:01:29 -0800 Subject: Java and Unicode In-Reply-To: <5502.1292089135@chthon> References: <5502.1292089135@chthon> Message-ID: <4D047349.7090603@oracle.com> Hi Tom, Thanks for looking into the Unicode support issues in Java RegEx. Since you haven been working on Unicode in the past decade, I'm sure you understand that most of the issues you are pointing out here belongs to the "Extended Unicode Support: Level 2" as documented in UTS#18 Unicode Regular Expressions [2]. Unfortunately the current Java RegEx implementation only supports the "Basic Unicode Support: Level 1", as specified in Java RegEx java.util.regex.Pattern API document [1]. I'm aware of and impressed by the Unicode support added "recently" in Perl 6, was planning to close the gap (basically Java RegEx is the implementation that "matches" perl 5) in JDK7. Due to resource issue I only managed to add in the script and name support in RegEx and Character class. hope I can have more in the next couple months otherwise the rest will be deferred to JDK8 (the \X probably is the most important one next on my list) As regarding the POSIX properties. In Java RegEx Unicode Alphabetic, Lowercase or Whitespace properties are supported by using \p{javaLetter}, \p{javaLowerCase}, \p{javaUpperCase} or \p{javaWhitespace}. The \p{Lower/Upper/ASCII/Alpha...}, as noticed, are clearly specified by the Java RegEx specification[1] that are for US_ASCII only (does Perl 5 work in this way as well?) This is by design and I don't agree "this is a mess" conclusion. While there are developers over there might like these properties to evolve to be the Unicode properties, I am pretty much sure there might be the same amount of developers there would prefer these properties be kept as the "original" POSIX properties. At the end, Java RegEx is NOT a Unicode RegEx, while it supports Unicode RegEx at certain level, sometime via different syntax, I don't feel this is a big problem for most Java developers and should not be a stopper for most program. I would also like to point out that Java is NOT a RegEx based language/platform, RegEx is not part of the Java language (I means the language specification), it is one of the utility packages in Java platform's core libraries. So even certain Unicode Properties are not yet supported by Java RegEx, it does not means they are not supported by the platform, you should be able to access those Unicode properties via java.lang.Character class[4]. So I would strong disagree the comment that "Java?s Unicode property support is *strictly antemillennial*, by which I mean it supports no Unicode property that has come out in the last decade.":-) Even Java RegEx is NOT that bad, the script, block can category property support are pretty "up to date". Anyway, as I said we do have "plan" to improve the Unicode Regex support in Java RegEx and are adding more pieces into it, while it might be a little slower than people would like to see (currently I can only spend less than 5% of my time to RegEx for JDK7, hope I can allocate more time the next couple months). The good news is that Java is now a open source project/ platform, I'm sure your decade of experience in Unicode and Perl would definite help should you decide to contribute [5]. Even without direct code contribution, it would still benefit the java community if you can spend some time to list all your concerns about the Unicode support in Java RegEx, I promise I will go through them one by one (I will look into [3] next week in more details next week) . I believe most of the Java Unicode "expert" are on this mailing list, so we can start from here. Thanks, Sherman [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html [2] http://www.unicode.org/reports/tr18 [3] http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261 [4] http://download.java.net/jdk7/docs/api/java/lang/Character.html [5] http://openjdk.java.net/contribute/ On 12/11/2010 09:38, Tom Christiansen wrote: > Good morning, > > I'm Tom Christiansen; some of you may know me from my work in the Perl > Community. I'm here at the urging of Martijn Verburg, who thought that my > recent discoveries should be heard by your group. > > I've been professionally programming for more than 25 years now, mostly in > C and Perl. I recently joined the biomedical text-mining group at the > University of Colorado, where the bulk of our code base is in Java. > > I've been responsible for working with large text corpora entirely in > Unicode. For example, one corpus comprises almost 200,000 papers and 11 > gigabytes, while another is a single file of 6 gigabytes. I'm not new to > Unicode, having worked with it a great deal over the last decade. > > Although most of our code base is in Java, we also have a considerable > portion of Perl code and some Python code, too. This code often first > tokenizes the input stream before moving on to more sophisticated semantic > processing. I was quite surprised to learn how differently Java treated > Unicode text than how the same text is treated by Perl and Python, even > using identical regular expressions. This has proved to be a significant > barrier to fully adopting Java for our Unicode work. > > This prompted me to make a comprehensive study of Unicode issues in Java, > focusing on regular expressions but also exploring other areas. I've > identified about two dozen individual areas that I feel deserve to be > looked at. These range from mismatches between documentation and behavior, > to unfortunate or inconvenient defaults (e.g. "documented not to work"), to > genuine bugs and international standards violations. > > Taken as a whole, these problem areas make Java a very difficult choice for > the sort of text processing my group needs to use it for. Surely many > others all around the world are in a similar position. > > I've searched the archives for this mailing list, and have found no mention > of these troubles either there, or indeed anywhere at all on the web. For > example: > > http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8 > > I have working code that fixes what for us is the most egregious of these > problems: that regexes were unusable on Unicode. One fundamental bug is > that Java has misunderstood the connection between \b and \w regexes, so > that now a string like "?l?ve" is not matched by the pattern "\b\w+\b" at > any point in the string. > > Other very serious problems include Java's unjustifiable demotion of legal > Unicode whitespace characters from the set of whitespace characters > (breaking tokenization), using Unicode property names in ways contrary to > what the spec says they do, and in general supporting no Unicode properties > any later than 3.0: even the critical Unicode 3.1 properties are ignored by > Java. These are very serious problems. Java almost cannot be said to > support Unicode--at least any Unicode release from the last ten > years--until these critical deficiencies are fixed. > > You can find a brief synopsis of these specific troubles as well as a link > to the Java code that fixes them here: > > http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261 > > I don't by any means think this is the best way to go about this. It's > just a band-aide we needed quickly to allow us to move on with our work. > I'd like to offer it as a starting point for discussion of the issues that > prompted its creation. > > As I mentioned, I have a couple dozen different Java Unicode issues, and > this addresses just one or two of them. When I get time, I'll try to bring > up the others here in separate threads. > > If you could advise me how best to contribute to helping out here, I would > be grateful. > > Thank you, > > --tom -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20101211/6aedeb6c/attachment.html