Non-ASCII characters failing on Lion, multiple possible other locale based bugs

Mon Aug 1 09:08:51 PDT 2011

I've logged a bug (http://java.net/jira/browse/MACOSX_PORT-204), but there's a pretty serious problem with OpenJDK on Lion and I wanted to get some opinions on this.  Non-ASCII filenames (e.g. Japanese or Chinese) are completely failing - they are being converted to "?????" (some number of question marks) when you try to create and when you try to read them you get garbage unicode characters.

I've been tracking through the code and I think I've identified what's go on.

JNU_GetStringPlatformChars in jdk/src/share/native/common/jni_util.c returning "UTF-8" on OS X 10.6 and "US-ASCII" on OS X 10.7.  This is the encoding that is then used to convert Java strings to/from the file system encoding.

Tracking this back a bit more, I tracked the source of that encoding as coming from the property "sun.jnu.encoding" which is coming from GetJavaProperties in jdk/src/share/native/java/lang/System.c.  Tracking that back a bit more shows that the function ParseLocale in jdk/src/solaris/native/java/lang/java_props_md.c is calling setlocale.  The way it calls setlocale basically boils down to:

	setlocale(LC_ALL, "");
	char * lc = setlocale(LC_CTYPE, NULL);

On OS X 10.6, lc is set to "en_US.UTF-8".  On OS X 10.7, it is set to "C".  This then falls through to setting the locale to en_US instead of en_US.UTF-8.

Later in ParseLocale, nl_langinfo(CODESET) is called.  On OS X 10.6, it returns UTF-8 but on 10.7 it returns US-ASCII.  ParseLocale relies on the return from nl_langinfo to ultimately set the sun.jnu.encoding property.

I'm not sure if the way the JDK is calling setlocale and nl_langinfo is wrong or if OS X 10.7 has a bug.  It doesn't seem reasonable to change the return values so drastically.

I have a patch that fixes this that I can contribute but I think a little discussion of the right way to fix this is in order.  There appear to be some other calls to setlocale() and there is a call to nl_langinfo(CODESET) in utfInitialize in jdk/src/solaris/npt/utf_md.c that concerns me since that looks to be the deep in the guts of the Unicode processing.

Are there a set of tests for Unicode handling that can be run?  I'd be happy to take this on as an issue.  Also, is there some magic involved in getting a build that can be debugged?  I tried the "debug_build" target but the output seems to be only about half debuggable. 

Thanks,
Dave Smith
iGeek, Inc.