<i18n dev> Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Xueming Shen xueming.shen at oracle.com
Wed Feb 8 22:26:54 PST 2012


Hi

This is a long standing "regression" from 1.3.1 on how 
OutputStreamWriter.flush()/flushBuffer()
handles escape or shift sequence in some of the charset/encoding, for 
example the ISO-2022-JP.

ISO-2022-JP is encoding that starts with ASCII mode and then switches 
between ASCII andJapanese
characters through an escape sequence. For example, the escape sequence 
ESC $ B (0x1B, 0x24 0x42)
is used to  indicate the following bytes are Japanese (switch from ASCII 
mode to Japanese mode), and
  the ESC ( B (0x1b  0x28  0x42) is used to switch back to ASCII.

In Java's sun.io.CharToByteConvert (old generation charset converter) 
and the nio.io.charset.CharsetEncoder
usually switches back forth between ASCII and Japanese modes based on 
the input character sequence
(for example, if you are in ASCII mode, and your next input character is 
a Japanese, you add the
ESC $ B into the output first and then followed the converted input 
character, or if you are in Japanese
mode and your next input is ASCII, you output ESC ( B first to switch 
the mode and then the ASCII) and
switch back to ASCII mode (if the last mode is non-Japanese) if either 
the encoding is ending or the
flush() method gets invoked.

In JDK1.3.1,  OutputStreamWriter.flushBuffer() explicitly invokes 
sun.io.c2b's flushAny() to switch
back to ASCII mode every time the flush() or flushBuffer() (from 
PrintStream) gets invoked, as
showed at the end of this email. For example, as showed below, the code 
uses OutputStreamWriter
to "write" a Japanese character \u6700 to the underlying stream with 
iso-2022jp,

	ByteArrayOutputStream bos = new ByteArrayOutputStream();
         String str = "\u6700";
	OutputStreamWriter osw = new OutputStreamWriter(bos, "iso-2022-jp");
	osw.write(str, 0, str.length());

Since the iso-2022-jp starts with ASCII mode, we now have a Japanese, so we need to
switch into Japanese mode first (the first 3 bytes) and then the encoded Japanese
character (the following 2 bytes)

0x1b 0x24 0x42 0x3a 0x47

and then the code invokes

         osw.flush();

since we are now  in Japanese, the writer continues to write out

  0x1b 0x28 0x 42

to switch back to ASCII mode. The total output is 8 bytes after write() 
and flush().

However, when all encoidng/charset related codes were migrated from 
1.3.1's sun.io based to
1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2, we 
gradually migrated from
sun.io to java.nio.charset),  the "c2b.flushAny()" invocation obviously 
was dropped in
sun.nio.cs.StreamEncoder. It results in that the "switch back to ASCII 
mode" sequence is no longer
output when OutputStreamWriter.flush() or PrintStream.write(String) is 
invoked.

This does not trigger problem for most use scenario, if the "stream" 
finally gets closed
(in which the StreamEncoder does invoke encoder's flush() to output the 
escape sequence
to switch back to ASCII) or PrintStream.println(String) is used (in 
which it outputs a \n character,
since this \n is in ASCII range, it "accidentally " switches the mode 
back to ASCII).

But it obviously causes problem when you can't not close the 
OutputStreamWriter after
you're done your iso2022-jp writing (for example, you need continue to 
use the underlying
OutputStream for other writing, but not "this" osw),  for 1.3.1, these 
apps invoke osw.flush()
to force the output switch back to ASCII, this no longer works when we 
switch to java.nio.charset
in jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2). This is 
what happened in JavaMail,
as described in the bug report.

The solution is to re-store the "flush the encoder" mechanism in 
StreamEncoder's flushBuffer().

I have been hesitated to make this change for a while, mostly because 
this regressed behavior
has been their for 3 releases, and the change triggers yet another 
"behavior change". But given
there is no obvious workaround and it only changes the behavior of the 
charsets with this
shift in/out mechanism, mainly the iso-2022 family and those IBM 
EBCDIC_DBCS charsets,  I
decided to give it a try.

Here is the webreview

http://cr.openjdk.java.net/~sherman/6995537/webrev

Sherman


---------------------------------1.3.1 
OutputStreamWriter-----------------------
     /**
      * Flush the output buffer to the underlying byte stream, without 
flushing
      * the byte stream itself.  This method is non-private only so that 
it may
      * be invoked by PrintStream.
      */
     void flushBuffer() throws IOException {
     synchronized (lock) {
         ensureOpen();

         for (;;) {
         try {
             nextByte += ctb.flushAny(bb, nextByte, nBytes);
         }
         catch (ConversionBufferFullException x) {
             nextByte = ctb.nextByteIndex();
         }
         if (nextByte == 0)
             break;
         if (nextByte > 0) {
             out.write(bb, 0, nextByte);
             nextByte = 0;
         }
         }
     }
     }

     /**
      * Flush the stream.
      *
      * @exception  IOException  If an I/O error occurs
      */
     public void flush() throws IOException {
     synchronized (lock) {
         flushBuffer();
         out.flush();
     }
     }






More information about the i18n-dev mailing list