JDK 9 Build 111 seems to miss some locale data, Lucene tests fail with Farsi and Thai language
Uwe Schindler
uschindler at apache.org
Sat Mar 26 19:11:11 UTC 2016
Hi Alan, hi Robert, Hi Lucene developers,
I was able to reproduce the bug in isolation. The reason why Robert and you did not see it was quite simple:
- You need to enable a security manager
- You need to list all locales before
When you print the class name of the returned break iterator, with Java 8 or Java 9 b110 it returns: "class sun.util.locale.provider.DictionaryBasedBreakIterator"
With build 111 and no security manager, it prints: "class sun.util.locale.provider.DictionaryBasedBreakIterator" (all fine).
With build 111 and security manager enabled, it prints: "class sun.util.locale.provider.RuleBasedBreakIterator" (which is the wrong one for Thai).
Here is my test code:
import java.text.BreakIterator;
import java.util.*;
public class Test {
public static void main(String... args) throws Exception {
String[] availableLanguageTags = Arrays.stream(Locale.getAvailableLocales())
.map(Locale::toLanguageTag)
.sorted()
.distinct()
.toArray(String[]::new);
BreakIterator iterator = BreakIterator.getWordInstance(new Locale("th"));
System.out.println(iterator.getClass());
}
}
The availableLanguageTags is the code our test framework does before running a test. This is needed to trigger the bug.
The other problem around Farsi is the same: If you run without a security manager all passes. With security manager it fails. The reason is the same: The Collator returned is just a default Collator, not the one for Arabic/Farsi text.
So it looks like the initialization code for locales misses to do some doPrivileged() somewhere. Maybe that one was lost during the merge.
Uwe
-----
Uwe Schindler
uschindler at apache.org
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/
> -----Original Message-----
> From: Alan Bateman [mailto:Alan.Bateman at oracle.com]
> Sent: Saturday, March 26, 2016 3:10 PM
> To: Uwe Schindler <uschindler at apache.org>
> Cc: 'Rory O'Donnell' <rory.odonnell at oracle.com>; 'Core-Libs-Dev' <core-libs-
> dev at openjdk.java.net>; 'Robert Muir' <rcmuir at gmail.com>
> Subject: Re: JDK 9 Build 111 seems to miss some locale data, Lucene tests fail
> with Farsi and Thai language
>
> On 26/03/2016 11:56, Uwe Schindler wrote:
> > Hi,
> >
> > after also testing the separate "Jigsaw" build on jdk9.java.net I see the
> same problems. So both builds 111 are wrong.
> >
> > To me it looks like the Unicode data files are missing some information -
> which could again be a packaging bug. As said before, build 110 does not have
> this problem, so it seems to be a side-effect of Jigsaw merging.
> >
> > The following stuff does not work:
> >
> > (1) Thai's locale does not have working dictionary-based BreakIterator
> available. The following "check" in Lucene for this fails, because it cannot
> detect a boundary correctly:
> >
> > /**
> > * True if the JRE supports a working dictionary-based breakiterator for
> Thai.
> > * If this is false, this tokenizer will not work at all!
> > */
> > public static final boolean DBBI_AVAILABLE;
> > private static final BreakIterator proto =
> BreakIterator.getWordInstance(new Locale("th"));
> > static {
> > // check that we have a working dictionary-based break iterator for thai
> > proto.setText("ภาษาไทย");
> > DBBI_AVAILABLE = proto.isBoundary(4);
> > }
> >
> > After this static initializer, DBBI_AVAILABLE is false. This makes some tests
> to be ignored, but 2 fail because of this (which might be an oversight on our
> side). But nevertheless, this is a bug in build 111.
> I just tried to duplicate this on OSX and Linux without success. The log
> you linked to suggests this is Linux, is that right? Is this the JDK
> bundle, I haven't checked the JRE bundle but would be surprise anything
> is missing. The JDK has several tests for Thai so if it was completely
> broken then I would have expected it would have been seen. I've no doubt
> that it is not working in your environment, we just need to figure out
> what is different.
>
> >
> > (2) The collator for Arabic (Farsi) language fails to work correctly. This also
> looks like missing data.
> >
> > Collator collator = Collator.getInstance(new Locale("ar"));
> >
> Are there any exceptions or anything here? Or maybe it tests the
> collector with compare?
>
> -Alan
More information about the core-libs-dev
mailing list