Non-ASCII characters failing on Lion, multiple possible other locale based bugs
David Smith-Uchida
dave at igeekinc.com
Tue Aug 2 07:35:36 PDT 2011
Well, I tracked down the setlocal/nl_langinfo discrepancy. I'd forgotten to double check it in a local window - I was logged in remotely over SSH to the Lion machine and you get different answers depending on whether you are logged in remotely or if you are connected to the window server. So, it's not Lion specific and you will get the issue if you are running via ssh or headless/background. Since you get the same answers from setlocale/nl_langinfo on Snow Leopard, whatever code is in JDK 1.6 should be a solution for this.
I'm not sure I understand why there is a locale lookup for the file system API's, though. As far as I know, all Unix systems will only use UTF-8 through the section 2 api's so there's really no way to get different character sets back. OS 9 had different character sets in the API's depending on the locale setting and I'm sure older Windows did as well.
http://java.net/jira/browse/MACOSX_PORT-165 is related somehow but I'm not sure yet how. I'm going to try to track it all down and get the info into the bugs.
On Aug 2, 2011, at 4:25 AM, Mike Swingler wrote:
> On Aug 1, 2011, at 9:08 AM, David Smith-Uchida wrote:
>
>> I've logged a bug (http://java.net/jira/browse/MACOSX_PORT-204), but there's a pretty serious problem with OpenJDK on Lion and I wanted to get some opinions on this. Non-ASCII filenames (e.g. Japanese or Chinese) are completely failing - they are being converted to "?????" (some number of question marks) when you try to create and when you try to read them you get garbage unicode characters.
>
> We have yet to integrate the native locale handling that Java SE 6 on Mac OS X does, but that work is being tracked in: <http://java.net/jira/browse/MACOSX_PORT-38>.
>
>> I've been tracking through the code and I think I've identified what's go on.
>>
>> JNU_GetStringPlatformChars in jdk/src/share/native/common/jni_util.c returning "UTF-8" on OS X 10.6 and "US-ASCII" on OS X 10.7. This is the encoding that is then used to convert Java strings to/from the file system encoding.
>>
>> Tracking this back a bit more, I tracked the source of that encoding as coming from the property "sun.jnu.encoding" which is coming from GetJavaProperties in jdk/src/share/native/java/lang/System.c. Tracking that back a bit more shows that the function ParseLocale in jdk/src/solaris/native/java/lang/java_props_md.c is calling setlocale. The way it calls setlocale basically boils down to:
>>
>> setlocale(LC_ALL, "");
>> char * lc = setlocale(LC_CTYPE, NULL);
>>
>> On OS X 10.6, lc is set to "en_US.UTF-8". On OS X 10.7, it is set to "C". This then falls through to setting the locale to en_US instead of en_US.UTF-8.
>
> This is odd. We'll have to look at this when we address this issue in greater detail. If you have a small native .c test case that shows two different behaviors between 10.6 and 10.7, I'd suggest filling a bug at <http://bugreporter.apple.com>, and let me know what the bug ID is so I can track it. Being a brand new OS, Lion certainly may have some issues that we may have to address in a software update.
>
>> Later in ParseLocale, nl_langinfo(CODESET) is called. On OS X 10.6, it returns UTF-8 but on 10.7 it returns US-ASCII. ParseLocale relies on the return from nl_langinfo to ultimately set the sun.jnu.encoding property.
>>
>> I'm not sure if the way the JDK is calling setlocale and nl_langinfo is wrong or if OS X 10.7 has a bug. It doesn't seem reasonable to change the return values so drastically.
>>
>> I have a patch that fixes this that I can contribute but I think a little discussion of the right way to fix this is in order. There appear to be some other calls to setlocale() and there is a call to nl_langinfo(CODESET) in utfInitialize in jdk/src/solaris/npt/utf_md.c that concerns me since that looks to be the deep in the guts of the Unicode processing.
>
> While the patch in MACOSX_PORT-38 gets us part of the way towards fixing the issue, we also need to integrate some code that we use that negotiates between what locales Java is aware of, and the locales the CFBundle is aware of, so we can pick the best one they can both agree on. Otherwise, higher levels of the UI will be very confused if Java uses one and the native NS/CFBundle resource loader is using another.
>
>> Are there a set of tests for Unicode handling that can be run? I'd be happy to take this on as an issue. Also, is there some magic involved in getting a build that can be debugged? I tried the "debug_build" target but the output seems to be only about half debuggable.
>
> We will be adding more tests in the area to the JTreg test suite as we improve the state of the actual locale code itself.
>
> Regards,
> Mike Swingler
> Java Engineering
> Apple Inc.
More information about the macosx-port-dev
mailing list