<i18n dev> Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Xueming Shen xueming.shen at oracle.com
Thu Feb 9 11:12:06 PST 2012


CCed Bill Shannon.

On 02/09/2012 11:10 AM, Xueming Shen wrote:
>
> CharsetEncoder has the "flush()" method as the last step (of a series 
> of "encoding" steps) to
> flush out any internal state to the output buffer. The issue here is 
> the the upper level wrapper
> class, OutputStreamWriter in our case, doesn't provide a "explicit" 
> mechanism to let the
> user to request a "flush" on the underlying encoder. The only 
> "guaranteed' mechanism is the
> "close()" method, in which it appears it not appropriate to invoke in 
> some use scenario, such
> as the JavaMail.writeTo() case.
>
> It appears we are lacking of a "close this stream, but not the 
> underlying stream" mechanism
> in our layered/chained streams, I have similar request for this kind 
> of mechanism in other area,
> such as in zip/gzip stream, app wraps a "outputstream" with zip/gzip, 
> they want to release the
> zip/gzip layer after use (to release the native resource, for example) 
> but want to keep the
> underlying stream unclosed. The only workaround now is to wrap the 
> underlying stream with
> a subclass to override  the "close()" method, which is really not 
> desirable.
>
> The OutputStreamWriter.flush() does not explicitly specify in its API 
> doc if it should actually
> flush the underlying charset encoder (so I would not argue strongly 
> that this IS a SE bug) but
> given it is flushing it's buffer (internal status) and then the 
> underlying "out" stream, it's
> reasonable to consider that the "internal status" of its encoder also 
> needs to be flushed.
> Especially this has been the behavior for releases earlier than 1.4.2.
>
> As I said,  while I have been hesitated to "fix" this problem for a 
> while (one it has been here
> for 3  major releases, two, the API does not explicitly say so) but as 
> long as we don't have a
> reasonable "close-ME-only" mechanism for those layered streams, it 
> appears to be a
> reasonable approach to solve the problem, without having obvious 
> negative impact.
>
> -Sherman
>
> PS: There is another implementation "detail" that the original 
> iso-2022-jp c2b converter
> actually restores the state back to ASCII mode at the end of its 
> "convert" method, this makes
> the analysis a little complicated, but should not change the issue we 
> are discussing)
>
>
> On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:
>> First of all, is this really a Java SE bug? The usage of 
>> OutputSteamWriter in JavaMail seems to be wrong to me. The writeTo 
>> method in the bug report doesn't seem to be able to deal with any 
>> stateful encodings.
>>
>> Masayoshi
>>
>> On 2/9/2012 3:26 PM, Xueming Shen wrote:
>>> Hi
>>>
>>> This is a long standing "regression" from 1.3.1 on how 
>>> OutputStreamWriter.flush()/flushBuffer()
>>> handles escape or shift sequence in some of the charset/encoding, 
>>> for example the ISO-2022-JP.
>>>
>>> ISO-2022-JP is encoding that starts with ASCII mode and then 
>>> switches between ASCII andJapanese
>>> characters through an escape sequence. For example, the escape 
>>> sequence ESC $ B (0x1B, 0x24 0x42)
>>> is used to  indicate the following bytes are Japanese (switch from 
>>> ASCII mode to Japanese mode), and
>>>  the ESC ( B (0x1b  0x28  0x42) is used to switch back to ASCII.
>>>
>>> In Java's sun.io.CharToByteConvert (old generation charset 
>>> converter) and the nio.io.charset.CharsetEncoder
>>> usually switches back forth between ASCII and Japanese modes based 
>>> on the input character sequence
>>> (for example, if you are in ASCII mode, and your next input 
>>> character is a Japanese, you add the
>>> ESC $ B into the output first and then followed the converted input 
>>> character, or if you are in Japanese
>>> mode and your next input is ASCII, you output ESC ( B first to 
>>> switch the mode and then the ASCII) and
>>> switch back to ASCII mode (if the last mode is non-Japanese) if 
>>> either the encoding is ending or the
>>> flush() method gets invoked.
>>>
>>> In JDK1.3.1,  OutputStreamWriter.flushBuffer() explicitly invokes 
>>> sun.io.c2b's flushAny() to switch
>>> back to ASCII mode every time the flush() or flushBuffer() (from 
>>> PrintStream) gets invoked, as
>>> showed at the end of this email. For example, as showed below, the 
>>> code uses OutputStreamWriter
>>> to "write" a Japanese character \u6700 to the underlying stream with 
>>> iso-2022jp,
>>>
>>>     ByteArrayOutputStream bos = new ByteArrayOutputStream();
>>>         String str = "\u6700";
>>>     OutputStreamWriter osw = new OutputStreamWriter(bos, 
>>> "iso-2022-jp");
>>>     osw.write(str, 0, str.length());
>>>
>>> Since the iso-2022-jp starts with ASCII mode, we now have a 
>>> Japanese, so we need to
>>> switch into Japanese mode first (the first 3 bytes) and then the 
>>> encoded Japanese
>>> character (the following 2 bytes)
>>>
>>> 0x1b 0x24 0x42 0x3a 0x47
>>>
>>> and then the code invokes
>>>
>>>         osw.flush();
>>>
>>> since we are now  in Japanese, the writer continues to write out
>>>
>>>  0x1b 0x28 0x 42
>>>
>>> to switch back to ASCII mode. The total output is 8 bytes after 
>>> write() and flush().
>>>
>>> However, when all encoidng/charset related codes were migrated from 
>>> 1.3.1's sun.io based to
>>> 1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2, 
>>> we gradually migrated from
>>> sun.io to java.nio.charset),  the "c2b.flushAny()" invocation 
>>> obviously was dropped in
>>> sun.nio.cs.StreamEncoder. It results in that the "switch back to 
>>> ASCII mode" sequence is no longer
>>> output when OutputStreamWriter.flush() or PrintStream.write(String) 
>>> is invoked.
>>>
>>> This does not trigger problem for most use scenario, if the "stream" 
>>> finally gets closed
>>> (in which the StreamEncoder does invoke encoder's flush() to output 
>>> the escape sequence
>>> to switch back to ASCII) or PrintStream.println(String) is used (in 
>>> which it outputs a \n character,
>>> since this \n is in ASCII range, it "accidentally " switches the 
>>> mode back to ASCII).
>>>
>>> But it obviously causes problem when you can't not close the 
>>> OutputStreamWriter after
>>> you're done your iso2022-jp writing (for example, you need continue 
>>> to use the underlying
>>> OutputStream for other writing, but not "this" osw),  for 1.3.1, 
>>> these apps invoke osw.flush()
>>> to force the output switch back to ASCII, this no longer works when 
>>> we switch to java.nio.charset
>>> in jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2). This 
>>> is what happened in JavaMail,
>>> as described in the bug report.
>>>
>>> The solution is to re-store the "flush the encoder" mechanism in 
>>> StreamEncoder's flushBuffer().
>>>
>>> I have been hesitated to make this change for a while, mostly 
>>> because this regressed behavior
>>> has been their for 3 releases, and the change triggers yet another 
>>> "behavior change". But given
>>> there is no obvious workaround and it only changes the behavior of 
>>> the charsets with this
>>> shift in/out mechanism, mainly the iso-2022 family and those IBM 
>>> EBCDIC_DBCS charsets,  I
>>> decided to give it a try.
>>>
>>> Here is the webreview
>>>
>>> http://cr.openjdk.java.net/~sherman/6995537/webrev
>>>
>>> Sherman
>>>
>>>
>>> ---------------------------------1.3.1 
>>> OutputStreamWriter-----------------------
>>>     /**
>>>      * Flush the output buffer to the underlying byte stream, 
>>> without flushing
>>>      * the byte stream itself.  This method is non-private only so 
>>> that it may
>>>      * be invoked by PrintStream.
>>>      */
>>>     void flushBuffer() throws IOException {
>>>     synchronized (lock) {
>>>         ensureOpen();
>>>
>>>         for (;;) {
>>>         try {
>>>             nextByte += ctb.flushAny(bb, nextByte, nBytes);
>>>         }
>>>         catch (ConversionBufferFullException x) {
>>>             nextByte = ctb.nextByteIndex();
>>>         }
>>>         if (nextByte == 0)
>>>             break;
>>>         if (nextByte > 0) {
>>>             out.write(bb, 0, nextByte);
>>>             nextByte = 0;
>>>         }
>>>         }
>>>     }
>>>     }
>>>
>>>     /**
>>>      * Flush the stream.
>>>      *
>>>      * @exception  IOException  If an I/O error occurs
>>>      */
>>>     public void flush() throws IOException {
>>>     synchronized (lock) {
>>>         flushBuffer();
>>>         out.flush();
>>>     }
>>>     }
>>>
>>>
>>>
>>>
>



More information about the i18n-dev mailing list