JDK 9 Build 111 seems to miss some locale data, Lucene tests fail with Farsi and Thai language

Sat Mar 26 19:11:11 UTC 2016

Hi Alan, hi Robert, Hi Lucene developers,

I was able to reproduce the bug in isolation. The reason why Robert and you did not see it was quite simple:
- You need to enable a security manager
- You need to list all locales before

When you print the class name of the returned break iterator, with Java 8 or Java 9 b110 it returns: "class sun.util.locale.provider.DictionaryBasedBreakIterator"
With build 111 and no security manager, it prints: "class sun.util.locale.provider.DictionaryBasedBreakIterator" (all fine).
With build 111 and security manager enabled, it prints: "class sun.util.locale.provider.RuleBasedBreakIterator" (which is the wrong one for Thai).

Here is my test code:

import java.text.BreakIterator;
import java.util.*;

public class Test {
  public static void main(String... args) throws Exception {
    String[] availableLanguageTags = Arrays.stream(Locale.getAvailableLocales())
      .map(Locale::toLanguageTag)
      .sorted()
      .distinct()
      .toArray(String[]::new);
    BreakIterator iterator = BreakIterator.getWordInstance(new Locale("th"));
    System.out.println(iterator.getClass());
  }
}

The availableLanguageTags is the code our test framework does before running a test. This is needed to trigger the bug.

The other problem around Farsi is the same: If you run without a security manager all passes. With security manager it fails. The reason is the same: The Collator returned is just a default Collator, not the one for Arabic/Farsi text.

So it looks like the initialization code for locales misses to do some doPrivileged() somewhere. Maybe that one was lost during the merge.

Uwe

-----
Uwe Schindler
uschindler at apache.org 
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/

> -----Original Message-----
> From: Alan Bateman [mailto:Alan.Bateman at oracle.com]
> Sent: Saturday, March 26, 2016 3:10 PM
> To: Uwe Schindler <uschindler at apache.org>
> Cc: 'Rory O'Donnell' <rory.odonnell at oracle.com>; 'Core-Libs-Dev' <core-libs-
> dev at openjdk.java.net>; 'Robert Muir' <rcmuir at gmail.com>
> Subject: Re: JDK 9 Build 111 seems to miss some locale data, Lucene tests fail
> with Farsi and Thai language
> 
> On 26/03/2016 11:56, Uwe Schindler wrote:
> > Hi,
> >
> > after also testing the separate "Jigsaw" build on jdk9.java.net I see the
> same problems. So both builds 111 are wrong.
> >
> > To me it looks like the Unicode data files are missing some information -
> which could again be a packaging bug. As said before, build 110 does not have
> this problem, so it seems to be a side-effect of Jigsaw merging.
> >
> > The following stuff does not work:
> >
> > (1) Thai's locale does not have working dictionary-based BreakIterator
> available. The following "check" in Lucene for this fails, because it cannot
> detect a boundary correctly:
> >
> >    /**
> >     * True if the JRE supports a working dictionary-based breakiterator for
> Thai.
> >     * If this is false, this tokenizer will not work at all!
> >     */
> >    public static final boolean DBBI_AVAILABLE;
> >    private static final BreakIterator proto =
> BreakIterator.getWordInstance(new Locale("th"));
> >    static {
> >      // check that we have a working dictionary-based break iterator for thai
> >      proto.setText("ภาษาไทย");
> >      DBBI_AVAILABLE = proto.isBoundary(4);
> >    }
> >
> > After this static initializer, DBBI_AVAILABLE is false. This makes some tests
> to be ignored, but 2 fail because of this (which might be an oversight on our
> side). But nevertheless, this is a bug in build 111.
> I just tried to duplicate this on OSX and Linux without success. The log
> you linked to suggests this is Linux, is that right? Is this the JDK
> bundle, I haven't checked the JRE bundle but would be surprise anything
> is missing. The JDK has several tests for Thai so if it was completely
> broken then I would have expected it would have been seen. I've no doubt
> that it is not working in your environment, we just need to figure out
> what is different.
> 
> >
> > (2) The collator for Arabic (Farsi) language fails to work correctly. This also
> looks like missing data.
> >
> > Collator collator = Collator.getInstance(new Locale("ar"));
> >
> Are there any exceptions or anything here? Or maybe it tests the
> collector with compare?
> 
> -Alan