<i18n dev> Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Thu Feb 9 00:26:21 PST 2012

First of all, is this really a Java SE bug? The usage of 
OutputSteamWriter in JavaMail seems to be wrong to me. The writeTo 
method in the bug report doesn't seem to be able to deal with any 
stateful encodings.

Masayoshi

On 2/9/2012 3:26 PM, Xueming Shen wrote:
> Hi
>
> This is a long standing "regression" from 1.3.1 on how 
> OutputStreamWriter.flush()/flushBuffer()
> handles escape or shift sequence in some of the charset/encoding, for 
> example the ISO-2022-JP.
>
> ISO-2022-JP is encoding that starts with ASCII mode and then switches 
> between ASCII andJapanese
> characters through an escape sequence. For example, the escape 
> sequence ESC $ B (0x1B, 0x24 0x42)
> is used to  indicate the following bytes are Japanese (switch from 
> ASCII mode to Japanese mode), and
>  the ESC ( B (0x1b  0x28  0x42) is used to switch back to ASCII.
>
> In Java's sun.io.CharToByteConvert (old generation charset converter) 
> and the nio.io.charset.CharsetEncoder
> usually switches back forth between ASCII and Japanese modes based on 
> the input character sequence
> (for example, if you are in ASCII mode, and your next input character 
> is a Japanese, you add the
> ESC $ B into the output first and then followed the converted input 
> character, or if you are in Japanese
> mode and your next input is ASCII, you output ESC ( B first to switch 
> the mode and then the ASCII) and
> switch back to ASCII mode (if the last mode is non-Japanese) if either 
> the encoding is ending or the
> flush() method gets invoked.
>
> In JDK1.3.1,  OutputStreamWriter.flushBuffer() explicitly invokes 
> sun.io.c2b's flushAny() to switch
> back to ASCII mode every time the flush() or flushBuffer() (from 
> PrintStream) gets invoked, as
> showed at the end of this email. For example, as showed below, the 
> code uses OutputStreamWriter
> to "write" a Japanese character \u6700 to the underlying stream with 
> iso-2022jp,
>
>     ByteArrayOutputStream bos = new ByteArrayOutputStream();
>         String str = "\u6700";
>     OutputStreamWriter osw = new OutputStreamWriter(bos, "iso-2022-jp");
>     osw.write(str, 0, str.length());
>
> Since the iso-2022-jp starts with ASCII mode, we now have a Japanese, 
> so we need to
> switch into Japanese mode first (the first 3 bytes) and then the 
> encoded Japanese
> character (the following 2 bytes)
>
> 0x1b 0x24 0x42 0x3a 0x47
>
> and then the code invokes
>
>         osw.flush();
>
> since we are now  in Japanese, the writer continues to write out
>
>  0x1b 0x28 0x 42
>
> to switch back to ASCII mode. The total output is 8 bytes after 
> write() and flush().
>
> However, when all encoidng/charset related codes were migrated from 
> 1.3.1's sun.io based to
> 1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2, we 
> gradually migrated from
> sun.io to java.nio.charset),  the "c2b.flushAny()" invocation 
> obviously was dropped in
> sun.nio.cs.StreamEncoder. It results in that the "switch back to ASCII 
> mode" sequence is no longer
> output when OutputStreamWriter.flush() or PrintStream.write(String) is 
> invoked.
>
> This does not trigger problem for most use scenario, if the "stream" 
> finally gets closed
> (in which the StreamEncoder does invoke encoder's flush() to output 
> the escape sequence
> to switch back to ASCII) or PrintStream.println(String) is used (in 
> which it outputs a \n character,
> since this \n is in ASCII range, it "accidentally " switches the mode 
> back to ASCII).
>
> But it obviously causes problem when you can't not close the 
> OutputStreamWriter after
> you're done your iso2022-jp writing (for example, you need continue to 
> use the underlying
> OutputStream for other writing, but not "this" osw),  for 1.3.1, these 
> apps invoke osw.flush()
> to force the output switch back to ASCII, this no longer works when we 
> switch to java.nio.charset
> in jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2). This 
> is what happened in JavaMail,
> as described in the bug report.
>
> The solution is to re-store the "flush the encoder" mechanism in 
> StreamEncoder's flushBuffer().
>
> I have been hesitated to make this change for a while, mostly because 
> this regressed behavior
> has been their for 3 releases, and the change triggers yet another 
> "behavior change". But given
> there is no obvious workaround and it only changes the behavior of the 
> charsets with this
> shift in/out mechanism, mainly the iso-2022 family and those IBM 
> EBCDIC_DBCS charsets,  I
> decided to give it a try.
>
> Here is the webreview
>
> http://cr.openjdk.java.net/~sherman/6995537/webrev
>
> Sherman
>
>
> ---------------------------------1.3.1 
> OutputStreamWriter-----------------------
>     /**
>      * Flush the output buffer to the underlying byte stream, without 
> flushing
>      * the byte stream itself.  This method is non-private only so 
> that it may
>      * be invoked by PrintStream.
>      */
>     void flushBuffer() throws IOException {
>     synchronized (lock) {
>         ensureOpen();
>
>         for (;;) {
>         try {
>             nextByte += ctb.flushAny(bb, nextByte, nBytes);
>         }
>         catch (ConversionBufferFullException x) {
>             nextByte = ctb.nextByteIndex();
>         }
>         if (nextByte == 0)
>             break;
>         if (nextByte > 0) {
>             out.write(bb, 0, nextByte);
>             nextByte = 0;
>         }
>         }
>     }
>     }
>
>     /**
>      * Flush the stream.
>      *
>      * @exception  IOException  If an I/O error occurs
>      */
>     public void flush() throws IOException {
>     synchronized (lock) {
>         flushBuffer();
>         out.flush();
>     }
>     }
>
>
>
>