Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

newer
hg: jdk8/tl/langtools: 7142672:...

Xueming Shen

13 Feb 2012 13 Feb '12

5:36 p.m.

Hi This is a long standing Windows codepage support issue on Java platform (we probably have 20 bug/rfes filed for this particular issue and closed as the dup of 4153167). Windows supports two sets of codepages, ANSI (Windows) codepage and OEM (IBM) codepage. Windows uses ANSI/Windows codepage almost "everywhere" except in its dos/command prompt window, which uses OEM codepage. For example, on a normal English Windows, the default Windows codepage isCp1252 <http://msdn.microsoft.com/en-us/goglobal/cc305145> (west European Latin) and the OEM codepage used in its dos/command prompt however is Cp437 <http://msdn.microsoft.com/en-us/goglobal/cc305156> (you can use chcp command to check/change the "active" codepage used in your dos/coomand prompt). These two obviously have different mapping for certain code points, for example those umlaut characters. J2SE runtime chooses the ANSI/Windows codepage as its default charset for its i/o character reading/writing, graphic text display, etc. including System.out&err. This causes problem when the ANSI code page and OEM codepage are not "compatible" and you happen to need to write those "in-compatible" characters to the dos/command prompt, as show in the following test case String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df"; PrintWriter ps = new PrintWriter(new OutputStreamWriter(System.out, "Cp437"), true ); ps.println("ps.cp437: " + umlaut); System.out.println("sys.out : " + umlaut); System.err.println("sys.err : " + umlaut); You will see the umlauts get displayed correctly from PrintWriter with explicit Cp437 encoding setting, but garbled from system.out and err (because both the System.out & err use the default charset Cp1252, which is also used for all necessary Unicode <-> Windows encoding conversion for that particular vm instance). For years, we have been debating whether or not we should and how to fix this issue, do we want to have two "default charset" for i/o. In jdk6, we have provided a java.io.Console class that specifically uses OEM codepage when running on Windows' dos/command prompt. However, the feedback is that people still want the System.out/err to work correctly with the dos/command prompt, when the OEM codepage used is not "compatible" with the default Windows codepage. The proposed change here is to use OEM codepage for System.out/err when the vm is started without its std out/err is redirected to something else, such as a file (make sure to only use OME for the dos/command prompt), if vm's std out/err is redirected, then continue to use the default charset (file.encoding) for the System.out/err. I believe this approach solves the problem without breaking any existing assumption/use scenario. The webrev is at http://cr.openjdk.java.net/~sherman/4153167/webrev Here is a simple"manual" test case. public class HelloWorld { public static void main(String[] args) throws Exception { String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df"; System.out.println("file.encoding =" + System.getProperty("file.encoding")); System.out.println("stdout.encoding=" + System.getProperty("sun.stdout.encoding")); System.out.println("stderr.encoding=" + System.getProperty("sun.stderr.encoding")); System.out.println("-----------------------"); PrintWriter ps = new PrintWriter(new OutputStreamWriter(System.out, "Cp437"), true ); ps.println("ps.cp437: " + umlaut); System.out.println("sys.out : " + umlaut); System.err.println("sys.err : " + umlaut); Console con = System.console(); if (con != null) con.printf("console : %s%n", umlaut); } } -Sherman

Show replies by date

Ulf Zibis

13 Feb 13 Feb

6:15 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

Interesting issue, especially for us germans! What is about System.in, if one types some umlaute at Windows console? Why are there theoretically different code pages for stdout and stderr? -Ulf Am 13.02.2012 18:36, schrieb Xueming Shen:

...

Hi

This is a long standing Windows codepage support issue on Java platform (we probably have 20 bug/rfes filed for this particular issue and closed as the dup of 4153167). Windows supports two sets of codepages, ANSI (Windows) codepage and OEM (IBM) codepage. Windows uses ANSI/Windows codepage almost "everywhere" except in its dos/command prompt window, which uses OEM codepage. For example, on a normal English Windows, the default Windows codepage isCp1252 <http://msdn.microsoft.com/en-us/goglobal/cc305145> (west European Latin) and the OEM codepage used in its dos/command prompt however is Cp437 <http://msdn.microsoft.com/en-us/goglobal/cc305156> (you can use chcp command to check/change the "active" codepage used in your dos/coomand prompt). These two obviously have different mapping for certain code points, for example those umlaut characters.

J2SE runtime chooses the ANSI/Windows codepage as its default charset for its i/o character reading/writing, graphic text display, etc. including System.out&err. This causes problem when the ANSI code page and OEM codepage are not "compatible" and you happen to need to write those "in-compatible" characters to the dos/command prompt, as show in the following test case

String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df"; PrintWriter ps = new PrintWriter(new OutputStreamWriter(System.out, "Cp437"), true ); ps.println("ps.cp437: " + umlaut); System.out.println("sys.out : " + umlaut); System.err.println("sys.err : " + umlaut);

You will see the umlauts get displayed correctly from PrintWriter with explicit Cp437 encoding setting, but garbled from system.out and err (because both the System.out & err use the default charset Cp1252, which is also used for all necessary Unicode <-> Windows encoding conversion for that particular vm instance).

For years, we have been debating whether or not we should and how to fix this issue, do we want to have two "default charset" for i/o. In jdk6, we have provided a java.io.Console class that specifically uses OEM codepage when running on Windows' dos/command prompt. However, the feedback is that people still want the System.out/err to work correctly with the dos/command prompt, when the OEM codepage used is not "compatible" with the default Windows codepage.

The proposed change here is to use OEM codepage for System.out/err when the vm is started without its std out/err is redirected to something else, such as a file (make sure to only use OME for the dos/command prompt), if vm's std out/err is redirected, then continue to use the default charset (file.encoding) for the System.out/err. I believe this approach solves the problem without breaking any existing assumption/use scenario.

The webrev is at

http://cr.openjdk.java.net/~sherman/4153167/webrev

Here is a simple"manual" test case.

public class HelloWorld {

public static void main(String[] args) throws Exception {

String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df";

System.out.println("file.encoding =" + System.getProperty("file.encoding")); System.out.println("stdout.encoding=" + System.getProperty("sun.stdout.encoding")); System.out.println("stderr.encoding=" + System.getProperty("sun.stderr.encoding")); System.out.println("-----------------------");

PrintWriter ps = new PrintWriter(new OutputStreamWriter(System.out, "Cp437"), true ); ps.println("ps.cp437: " + umlaut); System.out.println("sys.out : " + umlaut); System.err.println("sys.err : " + umlaut); Console con = System.console(); if (con != null) con.printf("console : %s%n", umlaut); } }

-Sherman

Xueming Shen

6:35 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

On 2/13/2012 10:15 AM, Ulf Zibis wrote:

...

Interesting issue, especially for us germans!

What is about System.in, if one types some umlaute at Windows console?

System.in is a "InputStream", no charset involved there, you build your own "reader" on top of that yourself.

...

Why are there theoretically different code pages for stdout and stderr?

you can re-direct std err to a log file file but keep the std out to the console, or re-direct the std out but keep the std.err to the console, in these scenario, the stderr and stdout will use different code page. Basically the approach is that if the otuput stream gets re-directed, it keeps using the default charset (with the assumption that the rest of the world is using the Windows codepage), if not, use the oem codepage from the console on Windows, to make sure the System.out/err outputs the bits that the underlying console can understand. -Sherman -Sherman

...

-Ulf

Am 13.02.2012 18:36, schrieb Xueming Shen:

...
Hi

This is a long standing Windows codepage support issue on Java platform (we probably have 20 bug/rfes filed for this particular issue and closed as the dup of 4153167). Windows supports two sets of codepages, ANSI (Windows) codepage and OEM (IBM) codepage. Windows uses ANSI/Windows codepage almost "everywhere" except in its dos/command prompt window, which uses OEM codepage. For example, on a normal English Windows, the default Windows codepage isCp1252 <http://msdn.microsoft.com/en-us/goglobal/cc305145> (west European Latin) and the OEM codepage used in its dos/command prompt however is Cp437 <http://msdn.microsoft.com/en-us/goglobal/cc305156> (you can use chcp command to check/change the "active" codepage used in your dos/coomand prompt). These two obviously have different mapping for certain code points, for example those umlaut characters.

J2SE runtime chooses the ANSI/Windows codepage as its default charset for its i/o character reading/writing, graphic text display, etc. including System.out&err. This causes problem when the ANSI code page and OEM codepage are not "compatible" and you happen to need to write those "in-compatible" characters to the dos/command prompt, as show in the following test case

String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df"; PrintWriter ps = new PrintWriter(new OutputStreamWriter(System.out, "Cp437"), true ); ps.println("ps.cp437: " + umlaut); System.out.println("sys.out : " + umlaut); System.err.println("sys.err : " + umlaut);

You will see the umlauts get displayed correctly from PrintWriter with explicit Cp437 encoding setting, but garbled from system.out and err (because both the System.out & err use the default charset Cp1252, which is also used for all necessary Unicode <-> Windows encoding conversion for that particular vm instance).

For years, we have been debating whether or not we should and how to fix this issue, do we want to have two "default charset" for i/o. In jdk6, we have provided a java.io.Console class that specifically uses OEM codepage when running on Windows' dos/command prompt. However, the feedback is that people still want the System.out/err to work correctly with the dos/command prompt, when the OEM codepage used is not "compatible" with the default Windows codepage.

The proposed change here is to use OEM codepage for System.out/err when the vm is started without its std out/err is redirected to something else, such as a file (make sure to only use OME for the dos/command prompt), if vm's std out/err is redirected, then continue to use the default charset (file.encoding) for the System.out/err. I believe this approach solves the problem without breaking any existing assumption/use scenario.

The webrev is at

http://cr.openjdk.java.net/~sherman/4153167/webrev

Here is a simple"manual" test case.

public class HelloWorld {

public static void main(String[] args) throws Exception {

String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df";

System.out.println("file.encoding =" + System.getProperty("file.encoding")); System.out.println("stdout.encoding=" + System.getProperty("sun.stdout.encoding")); System.out.println("stderr.encoding=" + System.getProperty("sun.stderr.encoding")); System.out.println("-----------------------");

PrintWriter ps = new PrintWriter(new OutputStreamWriter(System.out, "Cp437"), true ); ps.println("ps.cp437: " + umlaut); System.out.println("sys.out : " + umlaut); System.err.println("sys.err : " + umlaut); Console con = System.console(); if (con != null) con.printf("console : %s%n", umlaut); } }

-Sherman

Ulf Zibis

9:20 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

Am 13.02.2012 19:35, schrieb Xueming Shen:

...

On 2/13/2012 10:15 AM, Ulf Zibis wrote:

...
Interesting issue, especially for us germans!

What is about System.in, if one types some umlaute at Windows console?

System.in is a "InputStream", no charset involved there, you build your own "reader" on top of that yourself. Well, in normal case, one would use the InputStreamReader with default charset. In case of Windows console, characters likely would be decoded wrong. So IMO there should be a mechanism, that e.g. InputStreamReader chooses the correct OEM charset, if not explicitly defined otherwise and if the underlying input stream System.in is directly reading from the Windows console.

...

...
Why are there theoretically different code pages for stdout and stderr?

you can re-direct std err to a log file file but keep the std out to the console, or re-direct the std out but keep the std.err to the console, in these scenario, the stderr and stdout will use different code page. Basically the approach is that if the otuput stream gets re-directed, it keeps using the default charset (with the assumption that the rest of the world is using the Windows codepage), if not, use the oem codepage from the console on Windows, to make sure the System.out/err outputs the bits that the underlying console can understand. Oops, I'm not sure, if you didn't misunderstood me. I mean, why are there 2 different properties? : "sun.stdout.encoding" "sun.stderr.encoding" Shouldn't something be enough like "console.encoding" as counterpart to "file.encoding" ?

-Ulf

Xueming Shen

11:02 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

To have separate sun.stdout.encoding and sun.stderr.encoding is mainly because of implementation convenience. I need three things from the native (1) is std.out tty (2) is std.err tty (3) the console encoding if (1) or (2) are true, and I tried to avoid to go down to native multiple times it appears passing back two encoding name is the easiest approach. The original plan was to remove them after use, maybe via sun.misc.VM.saveAndRemoveProperties() (or simply remove them directly), but then thought the info might be useful... Auto detect the encoding of InputStreamReader when it is attached the console is nice to have, but I would try to avoid doing that until I have to, before that I would still advise the use of java.io.Console class:-) -Sherman

...

...
...
Why are there theoretically different code pages for stdout and stderr?

you can re-direct std err to a log file file but keep the std out to the console, or re-direct the std out but keep the std.err to the console, in these scenario, the stderr and stdout will use different code page. Basically the approach is that if the otuput stream gets re-directed, it keeps using the default charset (with the assumption that the rest of the world is using the Windows codepage), if not, use the oem codepage from the console on Windows, to make sure the System.out/err outputs the bits that the underlying console can understand. Oops, I'm not sure, if you didn't misunderstood me. I mean, why are there 2 different properties? : "sun.stdout.encoding" "sun.stderr.encoding" Shouldn't something be enough like "console.encoding" as counterpart to "file.encoding" ?

-Ulf

Ulf Zibis

11:15 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

Sherman, thanks for your additional explanation. One nit more... Why you use the "sun." prefix? I think, "stdout.encoding" "stderr.encoding" would be enough + nicer. In some years, nobody will have any association with 'sun'. On the other hand, it would be more true to use: "windows.stdout.encoding" "windows.stderr.encoding" -Ulf Am 14.02.2012 00:02, schrieb Xueming Shen:

...

To have separate sun.stdout.encoding and sun.stderr.encoding is mainly because of implementation convenience. I need three things from the native (1) is std.out tty (2) is std.err tty (3) the console encoding if (1) or (2) are true, and I tried to avoid to go down to native multiple times it appears passing back two encoding name is the easiest approach. The original plan was to remove them after use, maybe via sun.misc.VM.saveAndRemoveProperties() (or simply remove them directly), but then thought the info might be useful...

Auto detect the encoding of InputStreamReader when it is attached the console is nice to have, but I would try to avoid doing that until I have to, before that I would still advise the use of java.io.Console class:-)

-Sherman

...
...
...
Why are there theoretically different code pages for stdout and stderr?

you can re-direct std err to a log file file but keep the std out to the console, or re-direct the std out but keep the std.err to the console, in these scenario, the stderr and stdout will use different code page. Basically the approach is that if the otuput stream gets re-directed, it keeps using the default charset (with the assumption that the rest of the world is using the Windows codepage), if not, use the oem codepage from the console on Windows, to make sure the System.out/err outputs the bits that the underlying console can understand. Oops, I'm not sure, if you didn't misunderstood me. I mean, why are there 2 different properties? : "sun.stdout.encoding" "sun.stderr.encoding" Shouldn't something be enough like "console.encoding" as counterpart to "file.encoding" ?

-Ulf

Bill Shannon

7:07 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

Thanks for fixing this!

...

The webrev is at

http://cr.openjdk.java.net/~sherman/4153167/webrev

You probably don't need to malloc 64 bytes for a string that's going to be less than 16 bytes. And shouldn't you use snprintf in any event? Unlike Unix, I assume Windows has no way to have multiple "console" devices, with stdout and stderr pointing to different devices? Is the console the only device that's FILE_TYPE_CHAR? Are there no serial port devices or other devices that are also of that type? Can you detect the case of creating an InputStreamReader using the default encoding, wrapped around the InputStream from System.in that refers to the console? If so, it might be good to handle that case as well, although at this point I would consider that to be "extra credit"! :-)

Xueming Shen

11:16 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

On 2/13/2012 11:07 AM, Bill Shannon wrote:

...

Thanks for fixing this!

...
The webrev is at

http://cr.openjdk.java.net/~sherman/4153167/webrev

You probably don't need to malloc 64 bytes for a string that's going to be less than 16 bytes. And shouldn't you use snprintf in any event?

Unlike Unix, I assume Windows has no way to have multiple "console" devices, with stdout and stderr pointing to different devices?

Is the console the only device that's FILE_TYPE_CHAR? Are there no serial port devices or other devices that are also of that type?

The Windows API doc says "LPT device or a console". The wiki's LPT section suggests it should also use IBM style extended ASCII cs, so I would assume it's fine to use the oem code page, if there is any > LPT usage. * LPT* (Line Print Terminal or Local Print Terminal) is the original, and still common, name of the parallel port <http://en.wikipedia.org/wiki/Parallel_port> interface on IBM PC-compatible <http://en.wikipedia.org/wiki/PC_compatible> computers <http://en.wikipedia.org/wiki/Computer>. It was designed to operate a text printer <http://en.wikipedia.org/wiki/Computer_printer> that used IBM <http://en.wikipedia.org/wiki/IBM>'s 8-bit extended ASCII <http://en.wikipedia.org/wiki/Extended_ASCII> character set <http://en.wikipedia.org/wiki/Character_set>. -Sherman

...

Can you detect the case of creating an InputStreamReader using the default encoding, wrapped around the InputStream from System.in that refers to the console? If so, it might be good to handle that case as well, although at this point I would consider that to be "extra credit"! :-)

Ulf Zibis

11:41 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

...

On 2/13/2012 11:07 AM, Bill Shannon wrote:

...
Can you detect the case of creating an InputStreamReader using the default encoding, wrapped around the InputStream from System.in that refers to the console? If so, it might be good to handle that case as well, although at this point I would consider that to be "extra credit"! :-)

From a nitpickers point of view it should, as bug 4153167 evaluation states: Will investiage the possibility of use cmd/console encoding (OEM by default on Windows) for System.in/err when the System.in/out is attached to a real "terminal" when jvm is started, in jdk8 timeframe -Ulf

Alan Bateman

15 Feb 15 Feb

3 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

On 13/02/2012 17:36, Xueming Shen wrote:

...

:

The webrev is at

http://cr.openjdk.java.net/~sherman/4153167/webrev

The changes look reasonable to me and looks like you have all the combinations of redirection covered. I'm not sure about the sun.std*.encoding properties as folks will find them. Probably okay for now. Minor comments - in System.java then it might be better to name the method newPrintStream. It would also be good to add a comment block to that method. In java_props_md.c then I agree with Bill's comment that you don't need 64 bytes. Minor nit is that you don't need spaces are both sides of the *. -Alan.

Xueming Shen

16 Feb 16 Feb

8:18 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

Thanks Alan, webrev has been updated accordingly. http://cr.openjdk.java.net/~sherman/4153167/webrev <http://cr.openjdk.java.net/%7Esherman/4153167/webrev/> -Sherman On 02/15/2012 07:00 AM, Alan Bateman wrote:

...

On 13/02/2012 17:36, Xueming Shen wrote:

...
:

The webrev is at

http://cr.openjdk.java.net/~sherman/4153167/webrev

The changes look reasonable to me and looks like you have all the combinations of redirection covered.

I'm not sure about the sun.std*.encoding properties as folks will find them. Probably okay for now.

Minor comments - in System.java then it might be better to name the method newPrintStream. It would also be good to add a comment block to that method. In java_props_md.c then I agree with Bill's comment that you don't need 64 bytes. Minor nit is that you don't need spaces are both sides of the *.

-Alan.

Alan Bateman

8:47 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

On 16/02/2012 20:18, Xueming Shen wrote:

...

Thanks Alan, webrev has been updated accordingly.

http://cr.openjdk.java.net/~sherman/4153167/webrev <http://cr.openjdk.java.net/%7Esherman/4153167/webrev/>

-Sherman

This looks reasonable to me, will be interesting to see if anyone notices. -Alan.

Tom Hawtin

17 Feb 17 Feb

3:35 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

On 16/02/2012 20:18, Xueming Shen wrote:

...

Thanks Alan, webrev has been updated accordingly.

http://cr.openjdk.java.net/~sherman/4153167/webrev <http://cr.openjdk.java.net/%7Esherman/4153167/webrev/>

Although it's in some sense safe in this case, you might get a grumble about introducing a new sprintf. +static char* getConsoleEncoding() +{ + char* buf = malloc(16); + int cp = GetConsoleCP(); + if (cp >= 874 && cp <= 950) + sprintf(buf, "ms%d", cp); + else + sprintf(buf, "cp%d", cp); + return buf; +} Tom

Alan Bateman

3:48 p.m.

New subject: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

On 17/02/2012 15:35, Tom Hawtin wrote:

...

Although it's in some sense safe in this case, you might get a grumble about introducing a new sprintf.

+static char* getConsoleEncoding() +{ + char* buf = malloc(16); + int cp = GetConsoleCP(); + if (cp >= 874 && cp <= 950) + sprintf(buf, "ms%d", cp); + else + sprintf(buf, "cp%d", cp); + return buf; +}

Tom

You're right, we should avoid sprintf. It's not an issue here but will be flagged by tools that do static analysis on the source. -Alan.

5138

Age (days ago)

5138

Last active (days ago)

List overview

Download

13 comments

5 participants

participants (5)

Alan Bateman
Bill Shannon
Tom Hawtin
Ulf Zibis
Xueming Shen