RFR 8124977 cmdline encoding challenges on Windows
Xueming Shen
xueming.shen at oracle.com
Wed Feb 24 06:43:04 UTC 2016
On 2/23/16 9:52 PM, Vladimir Shcherbakov wrote:
> Hi Sherman,
>
> 1) If you can point out the regression test cases that are compromised by the fix - it would be very helpful;
I don't have a specific regression test for now. Guess running a "old"
awt app with
Japanese/Chinese characters as menu item might help show the issue? if I
read those
awt lines correctly.
> 2) From my understanding you can change default encoding by starting java with -Dsun.jnu.encoding=UTF-8 - this is well known feature that never caused problems (javac doesn't have such a switch );
I have been telling people for decade -Dfile.encoding is not a
"supported" usage/feature :-) as it does
cause "inconsistent" behavior on different platforms with different use
scenario. And then the sun.jnu.encoding,
definitely with no intention to be specified via -D. That's a pure
contract between the java runtime and
the underlying os on how the string/text should be encoded when using
those platform APIs. I think I had
forwarded the internal CCC doc for sun.jnu.encoding a while back, no
-Dsun.jnu.encoding=XYZ please :-)
> 3) If you state that java is non-Unicode on Windows by nature - the issue JDK-8124977 is a feature not a bug :)
Ideally we should run the java runtime as a unicode app. launcher is not
a big issue. The
concern is the interface with jvm for those "char*".
Sherman
>
> Thanks,
> Vladimir.
>
> -----Original Message-----
> From: Xueming Shen [mailto:xueming.shen at oracle.com]
> Sent: Tuesday, February 23, 2016 8:54 PM
> To: Vladimir Shcherbakov <vlashch at microsoft.com>
> Cc: Naoto Sato <naoto.sato at oracle.com>; Kumar Srinivasan <kumar.x.srinivasan at oracle.com>; Martin Sawicki <marcins at microsoft.com>; core-libs-dev Libs <core-libs-dev at openjdk.java.net>
> Subject: Re: RFR 8124977 cmdline encoding challenges on Windows
>
> Vladimir,
>
> sun.jnu.encoding is used by
> JNU_NewStringPlatform/JNU_GetStringPlatformChars. The JNU_ pair is "widely" used by the various native library code to convert between the jstring and native char*, with the assumption that the "platform encoding" for the native char* is the "default" encoding used by the underlying platform/os APIs that takes char* parameters or return char* values, in case of Windows, it's the code page decided by the system locale. We have migrated certain areas completely to use the "W" version/WChar APIs, such as the https://na01.safelinks.protection.outlook.com/?url=java.io&data=01%7c01%7cvlashch%40microsoft.com%7c635061d867af4ad4105008d33cd679e7%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=l4G1yzKKhniPRYJvBsGxchsBZvuWliVq8tILa0pLoY8%3d, the system properties initialization, but I'm think lots of areas still work on the "A" apis, especially I think the "char*" interface between the jvm and the libraries is still the the "ansi" codepage, not the utf8. Those work on utf8 have their names explicitly named as "xyzUTF" or similar.
>
> For example, the "java_home_dir" path used in libjava/TimeZone.c/getSystemTimeZoneID/
> TimeZone_md.c/findJavaTZ_md is encoded frm jstring java_home to char* via JNU_GetSTringPlatformChars.
> Simply change/hardcode the jnu_sun.encoding to utf8 probably will cause the timezone code stop to work if the java_home_dir path has some non-ascii characters in it (the jdk/jre is installed in a Japanese/Chinese directory, for example).
>
> A quick "grep" indicates java.desktop/windows/native/libawt/windows
> package has a heavily
> usage of the JNU_ pair as well. I'm not sure if this awt implementation is still being used though :-)
>
> Before we clear all these internal "StringPlatform" use cases (I'm not sure if they are also used by external), I don't think we can simply set the sun.jnu.encoding to utf8, though it's very attractive.
>
> Thanks,
> -Sherman
>
> On 2/23/16 4:34 PM, Naoto Sato wrote:
>> Hi Vladimir,
>>
>> I think it would work fine with the Java launcher, but what about
>> other areas, which may rely on the native encodings? Java runtime is
>> in itself a "non-Unicode" application, so still there may be the area
>> affected by hardcoding "UTF-8" as the native encoding. Have you
>> checked in such cases? Sherman, will you comment on this too?
>>
>> Naoto
>>
>> On 2/23/16 2:12 PM, Vladimir Shcherbakov wrote:
>>> Hi Naoto,
>>>
>>> 1) The system locale determines which code page is used on the system
>>> by default on operating systems that use Unicode as their native
>>> encoding (all OSes from Windows 2000 to Windows 10) to convert text
>>> data from Unicode to code page whenever dealing with legacy
>>> non-Unicode applications. Only applications that do not use Unicode
>>> as their default character-encoding mechanism are affected by this
>>> setting; therefore, applications that are already Unicode-encoded can
>>> safely ignore the value and functionality of this setting.
>>>
>>> 2) The fundamental representation of text in Windows NT-based
>>> operating systems is UTF-16, and the WCHAR data type is a UTF-16 code
>>> unit. Java launcher, from the other side, uses CHAR as a code unit -
>>> so to use UNICODE charset with Java launcher we had to encode entire
>>> command line with UTF-8 (convert from UTF-16 to UTF-8). After that
>>> step we can state that Java launcher is Unicode-encoded and can
>>> safely ignore the value and functionality of the system locale. To
>>> let JVM know that we use UTF-8 as a default UNICODE encoding for
>>> platform string - we assign the value to sprops.sun_jnu_encoding
>>> property (mac osx does the same) instead of reading system locale
>>> code page.
>>>
>>> The main idea of the fix was to change the way of how java and javac
>>> works with so called platform string on Windows. Before the fix the
>>> platform string was read as ANSI encoded - that's why the system
>>> locale code page was very important. The sun.jnu.encoding property is
>>> responsible for storing the platform string encoding. On Windows the
>>> property could be set with the system locale but the system locale
>>> doesn't support (by design) UTF-8 or with -Dsun.jnu.encoding switch,
>>> but the switch only works with java not with javac, and the switch
>>> was useless for ANSI encoded platform string.
>>>
>>> Thanks,
>>> Vladimir.
>>>
>>> -----Original Message-----
>>> From: Naoto Sato [mailto:naoto.sato at oracle.com]
>>> Sent: Tuesday, February 23, 2016 10:47 AM
>>> To: Kumar Srinivasan <kumar.x.srinivasan at oracle.com>; Vladimir
>>> Shcherbakov <vlashch at microsoft.com>; SHEN,XUEMING
>>> <xueming.shen at oracle.com>
>>> Cc: Martin Sawicki <marcins at microsoft.com>; core-libs-dev Libs
>>> <core-libs-dev at openjdk.java.net>
>>> Subject: Re: RFR 8124977 cmdline encoding challenges on Windows
>>>
>>> Hello,
>>>
>>> Sorry if this has already been discussed, but this is my first time
>>> looking at the fix. In java_props_md.c, sprops.sun_jnu_encoding is
>>> now always "UTF-8". Is it always the case? What if the system admin
>>> switches the locale for "non-Unicode" applications in the Windows
>>> control panel?
>>>
>>> Naoto
>>>
>>> On 2/22/16 8:00 AM, Kumar Srinivasan wrote:
>>>> Hi Naoto, Sherman, can you please take a look.
>>>> I tested with the jprt build and test all tests pass.
>>>>
>>>> Hi Vladimir, et. al.,
>>>>
>>>> It appears that there has been more simplifications from the
>>>> previous webrev.04. :-)
>>>>
>>>> It would've helped if you highlight the changes you have made from
>>>> the previous revision, unfortunately this is one of the deficiencies
>>>> of webrev.
>>>>
>>>> There are some inconsistencies in the coding conventions:
>>>>
>>>> parse_manifest.c
>>>> + if (q == 0) return -1;
>>>>
>>>> we expect the return to be on the next line.
>>>>
>>>> similarly main.c
>>>>
>>>> if (0 == q)
>>>> {
>>>>
>>>> I can fix those up. If I were to push this change, who should I
>>>> attribute the changes to ? ie. in the Contributed-by: line of the
>>>> commit info ?
>>>> Please note these have to be email addresses of the contributors.
>>>>
>>>> Thanks
>>>> Kumar
>>>>
>>>>> Hi Kumar,
>>>>>
>>>>> We posted another web review here:
>>>>> https://na01.safelinks.protection.outlook.com/?url=http:%2f%2fcr.op
>>>>> en
>>>>> jdk.java.net%2f~kshoop%2f8124977%2fwebrev.05%2f&data=01%7C01%7Cvlas
>>>>> hc
>>>>> h%40microsoft.com%7Cf33316507f214e013a4008d33c81c785%7C72f988bf86f1
>>>>> 41
>>>>> af91ab2d7cd011db47%7C1&sdata=%2fTQaWH0KGurgvZcdCQRZHSyaftjlMsW5FVc%
>>>>> 2f
>>>>> 14Wc5fA%3d
>>>>>
>>>>> The patch was successfully tested.
>>>>>
>>>>> Test details:
>>>>> * Regression tests folder: jdk/test/tools/launcher/
>>>>> * Builds were used: windows-x86_64-normal-server-fastdebug,
>>>>> windows-x86_64-normal-server-release,
>>>>> windows-x86-normal-server-release;
>>>>> * Platforms were used: Windows 7(64 bit), Windows 8.1, Windows
>>>>> Server
>>>>> 2012 R2 DC, Windows 10 ;
>>>>> * System locales were used: English (United States), Persian,
>>>>> Japanese (Japan), Chinese (Traditional, Taiwan), Russian (Russia);
>>>>>
>>>>> Thanks,
>>>>> Vladimir.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Martin Sawicki
>>>>> Sent: Thursday, January 14, 2016 11:34 AM
>>>>> To: Kumar Srinivasan <kumar.x.srinivasan at oracle.com>; Vladimir
>>>>> Shcherbakov <vlashch at microsoft.com>
>>>>> Cc: core-libs-dev Libs <core-libs-dev at openjdk.java.net>; Naoto Sato
>>>>> <naoto.sato at oracle.com>
>>>>> Subject: RE: RFR 8124977 cmdline encoding challenges on Windows
>>>>>
>>>>> Thanks for the feedback.
>>>>> Investigating the regression failure.
>>>>> We'll get back as soon as we figure this out. (and yes, we'll run
>>>>> this through some localized Windows VMs)
>>>>>
>>>>> Cheers
>>>>>
>>>>> -----Original Message-----
>>>>> From: Kumar Srinivasan [mailto:kumar.x.srinivasan at oracle.com]
>>>>> Sent: Tuesday, January 12, 2016 2:35 PM
>>>>> To: Martin Sawicki <marcins at microsoft.com>; Vladimir Shcherbakov
>>>>> <vlashch at microsoft.com>
>>>>> Cc: core-libs-dev Libs <core-libs-dev at openjdk.java.net>; Naoto Sato
>>>>> <naoto.sato at oracle.com>
>>>>> Subject: Re: RFR 8124977 cmdline encoding challenges on Windows
>>>>>
>>>>> Hi Martin, Vladimir,
>>>>>
>>>>> It was suggested that this patch be tested on localized Windows
>>>>> machines and/or trying with the various Windows native encodings,
>>>>> appreciate if you can verify this as well.
>>>>>
>>>>> Thanks
>>>>> Kumar
>>>>>
>>>>> On 1/11/2016 1:10 PM, Kumar Srinivasan wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Was on vacation, I started to prepare the patch from webrev.04 for
>>>>>> integration. Please note: made some adjustments to your patch to
>>>>>> pass jcheck, ie. usage of tabs and space at line endings, and
>>>>>> modifications to Copyright dates.
>>>>>>
>>>>>> Also fixed a minor bug on unix replaced JLI_TRUE with JNI_TRUE.
>>>>>> I have attached a patch to for your reference.
>>>>>>
>>>>>> However, there is a regression test failure on Windows,
>>>>>> jdk/test/tools/launcher/I18NTest.java
>>>>>>
>>>>>> ---Test info----
>>>>>> Executed command: C:\mmm\jdk\bin\javac.exe i18nH▒lloWorld.java
>>>>>>
>>>>>> ++++Test Output++++
>>>>>> javac: file not found: i18nHélloWorld.java ----End test info-----
>>>>>>
>>>>>> Have you run all the launcher regression tests with this changeset ?
>>>>>>
>>>>>> Thanks
>>>>>> Kumar
>>>>>>
>>>>>>> Hi Kumar, just wondering if there are any updates on processing
>>>>>>> this submission.
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Vladimir Shcherbakov
>>>>>>> Sent: Wednesday, November 25, 2015 2:38 PM
>>>>>>> To: Kumar Srinivasan <kumar.x.srinivasan at oracle.com>; Martin
>>>>>>> Sawicki <marcins at microsoft.com>
>>>>>>> Cc: Kirk Shoop <Kirk.Shoop at microsoft.com>; core-libs-dev Libs
>>>>>>> <core-libs-dev at openjdk.java.net>
>>>>>>> Subject: RE: RFR 8124977 cmdline encoding challenges on Windows
>>>>>>>
>>>>>>> Hi Kumar,
>>>>>>>
>>>>>>> Please find updated webreview here:
>>>>>>> https://na01.safelinks.protection.outlook.com/?url=http:%2f%2fcr.
>>>>>>> op
>>>>>>> en
>>>>>>> jdk.java.net%2f~kshoop%2f8124977%2fwebrev.04%2f&data=01%7C01%7Cma
>>>>>>> rc
>>>>>>> in
>>>>>>> s%40microsoft.com%7C13ff309b775c4c019fc308d31ba0c43c%7C72f988bf86
>>>>>>> f1
>>>>>>> 41
>>>>>>> af91ab2d7cd011db47%7C1&sdata=3hhbO5mNPyTvtrTb4kCR42zsWGPGzDhqnmjp
>>>>>>> Nf
>>>>>>> wn
>>>>>>> bIw%3d
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vladimir.
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Kumar Srinivasan [mailto:kumar.x.srinivasan at oracle.com]
>>>>>>> Sent: Sunday, November 22, 2015 8:14 AM
>>>>>>> To: Martin Sawicki <marcins at microsoft.com>
>>>>>>> Cc: Kirk Shoop <Kirk.Shoop at microsoft.com>; Vladimir Shcherbakov
>>>>>>> <vlashch at microsoft.com>; core-libs-dev Libs
>>>>>>> <core-libs-dev at openjdk.java.net>
>>>>>>> Subject: Re: RFR 8124977 cmdline encoding challenges on Windows
>>>>>>>
>>>>>>>
>>>>>>> Hi Martin, et. al.,
>>>>>>>
>>>>>>> Sorry for not getting back earlier, I am very busy right now with
>>>>>>> my other large commitments for JDK9.
>>>>>>>
>>>>>>> I will sponsor this "enhancement/bug fix" sometime in the new
>>>>>>> year, meanwhile, there is the changeset [1] which is likely to
>>>>>>> cause merge conflicts, and perhaps logic issues.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Kumar
>>>>>>>
>>>>>>> [1]
>>>>>>> https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fhg.
>>>>>>> op
>>>>>>> enjdk.java.net%2fjdk9%2fdev%2fjdk%2frev%2f3b201a9ef918&data=01%7c
>>>>>>> 01
>>>>>>> %7
>>>>>>> cvlashch%40microsoft.com%7c4d49ae546dba4d29b7be08d2f3589ee1%7c72f
>>>>>>> 98
>>>>>>> 8b
>>>>>>> f86f141af91ab2d7cd011db47%7c1&sdata=I2FKvBn82%2fxhW3D%2fi%2bRWaNO
>>>>>>> Jk
>>>>>>> 7M
>>>>>>> g4lt2P0sdzLS%2fT9Q%3d
>>>>>>>> Hi all
>>>>>>>> Here's an updated webrev attempting to take into account the
>>>>>>>> various pieces of feedback we have received:
>>>>>>>>
>>>>>>>> Issue:
>>>>>>>> https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbugs.
>>>>>>>>
>>>>>>>> openjdk.java.net%2fbrowse%2fJDK-8124977&data=01%7c01%7cvlashch%4
>>>>>>>> 0m
>>>>>>>> ic
>>>>>>>> ro
>>>>>>>> soft.com%7c4d49ae546dba4d29b7be08d2f3589ee1%7c72f988bf86f141af91
>>>>>>>> ab
>>>>>>>> 2d
>>>>>>>> 7c
>>>>>>>> d011db47%7c1&sdata=FjmfM%2fnPbWB%2fMsUU8uDzAUo3aPu3zOELVsJO%2fsU
>>>>>>>> Iq
>>>>>>>> 9E
>>>>>>>> %3
>>>>>>>> d
>>>>>>>> Webrev:
>>>>>>>> https://na01.safelinks.protection.outlook.com/?url=http:%2f%2fcr
>>>>>>>> .o
>>>>>>>> pe
>>>>>>>> nj
>>>>>>>> dk.java.net%2f~kshoop%2f8124977%2fwebrev.03%2f&data=01%7C01%7Cvl
>>>>>>>> as
>>>>>>>> hc
>>>>>>>> h%
>>>>>>>> 40microsoft.com%7C4d49ae546dba4d29b7be08d2f3589ee1%7C72f988bf86f
>>>>>>>> 14
>>>>>>>> 1a
>>>>>>>> f9
>>>>>>>> 1ab2d7cd011db47%7C1&sdata=101HBPar2AZ63GJWyubWH0DiKmNI%2bOxknN66
>>>>>>>> 7B
>>>>>>>> Jn
>>>>>>>> WY
>>>>>>>> 0%3d
>>>>>>>>
>>>>>>>> (Vladimir Shcherbakov is now working on this from our side)
>>>>>>>>
>>>>>>>> Looking forward to any other feedback.
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: core-libs-dev
>>>>>>>> [mailto:core-libs-dev-bounces at openjdk.java.net]
>>>>>>>> On Behalf Of Kumar Srinivasan
>>>>>>>> Sent: Thursday, June 25, 2015 6:26 AM
>>>>>>>> To: Kirk Shoop (MS OPEN TECH) <Kirk.Shoop at microsoft.com>
>>>>>>>> Cc: Valery Kopylov (Akvelon) <v-valkop at microsoft.com>;
>>>>>>>> core-libs-dev Libs <core-libs-dev at openjdk.java.net>
>>>>>>>> Subject: Re: RFR 8124977 cmdline encoding challenges on Windows
>>>>>>>>
>>>>>>>> Hi Kirk,
>>>>>>>>
>>>>>>>> Thanks for proposing this change.
>>>>>>>>
>>>>>>>> If you notice all the posix calls are wrapped in JLI_* this
>>>>>>>> gives us the ability to use "W" functions. I almost got it
>>>>>>>> done, several years ago, but we upgraded to VS2010 and my work
>>>>>>>> based on
>>>>>>>> VS2003 keeled over, meanwhile my focus was "shifted" to
>>>>>>>> something else.
>>>>>>>>
>>>>>>>> main.c: is really envisioned to be a stub compiled by the tool
>>>>>>>> launchers, like java, javac, javah, jar etc. I prefer to see all
>>>>>>>> the heavy logic in this file moved to the platform specific file
>>>>>>>> windows/java_md.*
>>>>>>>>
>>>>>>>> For the reason specified above we need to move fprintf or any
>>>>>>>> naked posix calls to JLI_* indirections.
>>>>>>>>
>>>>>>>> I don't see any tests ? The tests must be written in java and
>>>>>>>> placed in jdk/test/tools/launcher, there is a helper framework
>>>>>>>> TestHelper.java.
>>>>>>>>
>>>>>>>> There are other changes in nio, charsets etc, this will be
>>>>>>>> reviewed by my colleague specializing in that area (Sherman) cc'ed.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Kumar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6/22/2015 2:01 PM, Kirk Shoop (MS OPEN TECH) wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Issue:
>>>>>>>>> https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%
>>>>>>>>> 2f
>>>>>>>>> bu
>>>>>>>>> gs
>>>>>>>>> .openjdk.java.net%2fbrowse%2fJDK-8124977&data=01%7c01%7cvlashch
>>>>>>>>> %4
>>>>>>>>> 0m
>>>>>>>>> ic
>>>>>>>>> rosoft.com%7c4d49ae546dba4d29b7be08d2f3589ee1%7c72f988bf86f141a
>>>>>>>>> f9
>>>>>>>>> 1a
>>>>>>>>> b2
>>>>>>>>> d7cd011db47%7c1&sdata=FjmfM%2fnPbWB%2fMsUU8uDzAUo3aPu3zOELVsJO%
>>>>>>>>> 2f
>>>>>>>>> sU
>>>>>>>>> Iq
>>>>>>>>> 9E%3d
>>>>>>>>>
>>>>>>>>> Webrev:
>>>>>>>>> https://na01.safelinks.protection.outlook.com/?url=http:%2f%2fcr.
>>>>>>>>> op
>>>>>>>>> en
>>>>>>>>> jdk.java.net%2f~kshoop%2f8124977%2f&data=01%7C01%7Cvlashch%40mi
>>>>>>>>> cr
>>>>>>>>> os
>>>>>>>>> of
>>>>>>>>> t.com%7C4d49ae546dba4d29b7be08d2f3589ee1%7C72f988bf86f141af91ab
>>>>>>>>> 2d
>>>>>>>>> 7c
>>>>>>>>> d0
>>>>>>>>> 11db47%7C1&sdata=RAA%2b5aIzKtrk5X85oLXKlPzbpSk%2bgJZRI%2b0QSI11
>>>>>>>>> B0
>>>>>>>>> M%
>>>>>>>>> 3d
>>>>>>>>>
>>>>>>>>> This webrev intends to address interaction between Windows
>>>>>>>>> console and java apps.
>>>>>>>>>
>>>>>>>>> Two switches were added that change the behavior of the launcher.
>>>>>>>>> The defaults do not change the launcher behavior.
>>>>>>>>>
>>>>>>>>> -Dwindows.UnicodeConsole=true - switches on Unicode
>>>>>>>>> support in the Windows console. This optional switch causes the
>>>>>>>>> launcher to call GetCommandLineW() and parse the arguments in
>>>>>>>>> unicode. It also modifies how the codepage for console output is selected.
>>>>>>>>>
>>>>>>>>> -Dfile.encoding.unicode="UTF-8" - identifies Unicode
>>>>>>>>> charset to use; If not specified, UTF-8 is used by default.
>>>>>>>>> Ignored when windows.UnicodeConsole is not set to true. When
>>>>>>>>> the first switch is used, this optional switch allows the
>>>>>>>>> codepage for console output to be controlled.
>>>>>>>>>
>>>>>>>>> I would like to get feedback on the approach here and any
>>>>>>>>> additional work that is required solve these particular Unicode
>>>>>>>>> issues on Windows.
>>>>>>>>>
>>>>>>>>> Kirk
More information about the core-libs-dev
mailing list