<i18n dev> Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

Mon Feb 13 10:35:27 PST 2012

On 2/13/2012 10:15 AM, Ulf Zibis wrote:
> Interesting issue, especially for us germans!
>
> What is about System.in, if one types some umlaute at Windows console?

System.in is a "InputStream",  no charset involved there,  you build 
your own "reader"
on top of that yourself.

>
> Why are there theoretically different code pages for stdout and stderr?

you can re-direct std err to a log file file but keep the std out to the 
console, or re-direct
the std out but keep the std.err to the console, in these scenario, the 
stderr and stdout
will use different code page. Basically the approach is that if the 
otuput stream gets
re-directed, it keeps using the default charset (with the assumption 
that the rest of the
world is using the Windows codepage), if not, use the oem codepage from 
the console
on Windows, to make sure the System.out/err outputs the bits that the 
underlying
console can understand.

-Sherman

-Sherman

>
> -Ulf
>
>
> Am 13.02.2012 18:36, schrieb Xueming Shen:
>> Hi
>>
>> This is a long standing Windows codepage support issue on Java 
>> platform (we probably have
>> 20 bug/rfes filed for this particular issue and closed as the dup of 
>> 4153167). Windows supports
>> two sets of codepages,  ANSI (Windows) codepage and OEM (IBM) 
>> codepage.  Windows uses
>> ANSI/Windows codepage almost "everywhere" except in its dos/command 
>> prompt window,
>> which uses OEM codepage. For example, on a normal English Windows, 
>> the default Windows
>> codepage isCp1252 <http://msdn.microsoft.com/en-us/goglobal/cc305145> 
>> (west European Latin) and the OEM codepage used in its dos/command
>> prompt however is Cp437 
>> <http://msdn.microsoft.com/en-us/goglobal/cc305156> (you can use chcp 
>> command to check/change the "active" codepage
>> used in your dos/coomand prompt). These two obviously have different 
>> mapping for certain
>> code points, for example those umlaut characters.
>>
>> J2SE runtime chooses the ANSI/Windows codepage as its default charset 
>> for its i/o character
>> reading/writing, graphic text display, etc. including System.out&err. 
>> This causes problem when
>> the ANSI code page and OEM codepage are not "compatible" and you 
>> happen to need to write
>> those "in-compatible" characters to the dos/command prompt, as show 
>> in the following test
>> case
>>
>>         String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df";
>>         PrintWriter ps = new PrintWriter(new 
>> OutputStreamWriter(System.out, "Cp437"), true );
>>         ps.println("ps.cp437: " + umlaut);
>>         System.out.println("sys.out : " + umlaut);
>>         System.err.println("sys.err : " + umlaut);
>>
>> You will see the umlauts get displayed correctly from PrintWriter 
>> with explicit Cp437 encoding
>> setting, but garbled from system.out and err (because both the 
>> System.out & err use the default
>> charset Cp1252, which is also used for all necessary Unicode <-> 
>> Windows encoding conversion
>> for that particular vm instance).
>>
>> For years, we have been debating whether or not we should and how to 
>> fix this issue, do we
>> want to have two "default charset" for i/o. In jdk6, we have provided 
>> a java.io.Console class
>> that specifically uses OEM codepage when running on Windows' 
>> dos/command prompt.
>> However, the feedback is that people still want the System.out/err to 
>> work correctly with
>> the dos/command prompt, when the OEM codepage used is not 
>> "compatible" with the default
>> Windows codepage.
>>
>> The proposed change here is to use OEM codepage for System.out/err 
>> when the vm is
>> started without its std out/err is redirected to something else, such 
>> as a file (make sure
>> to only use OME for the dos/command prompt), if vm's std out/err is 
>> redirected, then
>> continue to use the default charset (file.encoding) for the 
>> System.out/err.  I believe this
>> approach solves the problem without breaking any existing 
>> assumption/use scenario.
>>
>> The webrev is at
>>
>> http://cr.openjdk.java.net/~sherman/4153167/webrev
>>
>> Here is a simple"manual" test case.
>>
>> public class HelloWorld {
>>
>>     public static void main(String[] args) throws Exception {
>>
>>         String umlaut = "\u00f6\u00e4\u00fc\u00d6\u00c4\u00dc\u00df";
>>
>>         System.out.println("file.encoding =" + 
>> System.getProperty("file.encoding"));
>>         System.out.println("stdout.encoding=" + 
>> System.getProperty("sun.stdout.encoding"));
>>         System.out.println("stderr.encoding=" + 
>> System.getProperty("sun.stderr.encoding"));
>>         System.out.println("-----------------------");
>>
>>         PrintWriter ps = new PrintWriter(new 
>> OutputStreamWriter(System.out, "Cp437"),
>>                                          true );
>>         ps.println("ps.cp437: " + umlaut);
>>         System.out.println("sys.out : " + umlaut);
>>         System.err.println("sys.err : " + umlaut);
>>         Console con = System.console();
>>         if (con != null)
>>             con.printf("console : %s%n", umlaut);
>>     }
>> }
>>
>> -Sherman
>>