<i18n dev> Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Fri Feb 10 10:55:19 PST 2012

If you flush the stream while in the middle of writing a "character",
I would expect the results to be undefined, just as if you closed the
stream at that point.  But at the end of a consistent set of data, I
would expect flush to behave like close, but without the actual closing
of the stream.

Masayoshi Okutsu wrote on 02/10/12 00:31:
> I tend to agree with Sherman that the real problem is the OutputStreamWriter API
> which isn't good enough to handle various encodings. My understanding is that
> the charset API was introduced in 1.4 to deal with the limitations of the
> java.io and other encoding handling issues.
>
> I still don't think it's correct to change the flush semantics. The flush method
> should just flush out any outstanding data to the given output stream as defined
> in java.io.Writer. What if Writer.write(int) is used write UTF-16 data with any
> stateful encoding? Suppose the stateful encoding can support supplementary
> characters which require other G0 designations, does the following calling
> sequence work?
>
> writer.write(highSurrogate);
> writer.flush();
> writer.write(lowSurrogate);
> writer.flush();
>
> Of course, this isn't a problem with iso-2022-jp, though.
>
> I think it's a correct fix, not a workaround, to create a filter stream to deal
> with stateful encodings with the java.io API. If it's OK to support only 1.4 and
> later, the java.nio.charset API should be used.
>
> Thanks,
> Masayoshi
>
> On 2/10/2012 4:12 AM, Xueming Shen wrote:
>> CCed Bill Shannon.
>>
>> On 02/09/2012 11:10 AM, Xueming Shen wrote:
>>>
>>> CharsetEncoder has the "flush()" method as the last step (of a series of
>>> "encoding" steps) to
>>> flush out any internal state to the output buffer. The issue here is the the
>>> upper level wrapper
>>> class, OutputStreamWriter in our case, doesn't provide a "explicit" mechanism
>>> to let the
>>> user to request a "flush" on the underlying encoder. The only "guaranteed'
>>> mechanism is the
>>> "close()" method, in which it appears it not appropriate to invoke in some
>>> use scenario, such
>>> as the JavaMail.writeTo() case.
>>>
>>> It appears we are lacking of a "close this stream, but not the underlying
>>> stream" mechanism
>>> in our layered/chained streams, I have similar request for this kind of
>>> mechanism in other area,
>>> such as in zip/gzip stream, app wraps a "outputstream" with zip/gzip, they
>>> want to release the
>>> zip/gzip layer after use (to release the native resource, for example) but
>>> want to keep the
>>> underlying stream unclosed. The only workaround now is to wrap the underlying
>>> stream with
>>> a subclass to override the "close()" method, which is really not desirable.
>>>
>>> The OutputStreamWriter.flush() does not explicitly specify in its API doc if
>>> it should actually
>>> flush the underlying charset encoder (so I would not argue strongly that this
>>> IS a SE bug) but
>>> given it is flushing it's buffer (internal status) and then the underlying
>>> "out" stream, it's
>>> reasonable to consider that the "internal status" of its encoder also needs
>>> to be flushed.
>>> Especially this has been the behavior for releases earlier than 1.4.2.
>>>
>>> As I said, while I have been hesitated to "fix" this problem for a while (one
>>> it has been here
>>> for 3 major releases, two, the API does not explicitly say so) but as long as
>>> we don't have a
>>> reasonable "close-ME-only" mechanism for those layered streams, it appears to
>>> be a
>>> reasonable approach to solve the problem, without having obvious negative
>>> impact.
>>>
>>> -Sherman
>>>
>>> PS: There is another implementation "detail" that the original iso-2022-jp
>>> c2b converter
>>> actually restores the state back to ASCII mode at the end of its "convert"
>>> method, this makes
>>> the analysis a little complicated, but should not change the issue we are
>>> discussing)
>>>
>>>
>>> On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:
>>>> First of all, is this really a Java SE bug? The usage of OutputSteamWriter
>>>> in JavaMail seems to be wrong to me. The writeTo method in the bug report
>>>> doesn't seem to be able to deal with any stateful encodings.
>>>>
>>>> Masayoshi
>>>>
>>>> On 2/9/2012 3:26 PM, Xueming Shen wrote:
>>>>> Hi
>>>>>
>>>>> This is a long standing "regression" from 1.3.1 on how
>>>>> OutputStreamWriter.flush()/flushBuffer()
>>>>> handles escape or shift sequence in some of the charset/encoding, for
>>>>> example the ISO-2022-JP.
>>>>>
>>>>> ISO-2022-JP is encoding that starts with ASCII mode and then switches
>>>>> between ASCII andJapanese
>>>>> characters through an escape sequence. For example, the escape sequence ESC
>>>>> $ B (0x1B, 0x24 0x42)
>>>>> is used to indicate the following bytes are Japanese (switch from ASCII
>>>>> mode to Japanese mode), and
>>>>> the ESC ( B (0x1b 0x28 0x42) is used to switch back to ASCII.
>>>>>
>>>>> In Java's sun.io.CharToByteConvert (old generation charset converter) and
>>>>> the nio.io.charset.CharsetEncoder
>>>>> usually switches back forth between ASCII and Japanese modes based on the
>>>>> input character sequence
>>>>> (for example, if you are in ASCII mode, and your next input character is a
>>>>> Japanese, you add the
>>>>> ESC $ B into the output first and then followed the converted input
>>>>> character, or if you are in Japanese
>>>>> mode and your next input is ASCII, you output ESC ( B first to switch the
>>>>> mode and then the ASCII) and
>>>>> switch back to ASCII mode (if the last mode is non-Japanese) if either the
>>>>> encoding is ending or the
>>>>> flush() method gets invoked.
>>>>>
>>>>> In JDK1.3.1, OutputStreamWriter.flushBuffer() explicitly invokes
>>>>> sun.io.c2b's flushAny() to switch
>>>>> back to ASCII mode every time the flush() or flushBuffer() (from
>>>>> PrintStream) gets invoked, as
>>>>> showed at the end of this email. For example, as showed below, the code
>>>>> uses OutputStreamWriter
>>>>> to "write" a Japanese character \u6700 to the underlying stream with
>>>>> iso-2022jp,
>>>>>
>>>>> ByteArrayOutputStream bos = new ByteArrayOutputStream();
>>>>> String str = "\u6700";
>>>>> OutputStreamWriter osw = new OutputStreamWriter(bos, "iso-2022-jp");
>>>>> osw.write(str, 0, str.length());
>>>>>
>>>>> Since the iso-2022-jp starts with ASCII mode, we now have a Japanese, so we
>>>>> need to
>>>>> switch into Japanese mode first (the first 3 bytes) and then the encoded
>>>>> Japanese
>>>>> character (the following 2 bytes)
>>>>>
>>>>> 0x1b 0x24 0x42 0x3a 0x47
>>>>>
>>>>> and then the code invokes
>>>>>
>>>>> osw.flush();
>>>>>
>>>>> since we are now in Japanese, the writer continues to write out
>>>>>
>>>>> 0x1b 0x28 0x 42
>>>>>
>>>>> to switch back to ASCII mode. The total output is 8 bytes after write() and
>>>>> flush().
>>>>>
>>>>> However, when all encoidng/charset related codes were migrated from 1.3.1's
>>>>> sun.io based to
>>>>> 1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2, we
>>>>> gradually migrated from
>>>>> sun.io to java.nio.charset), the "c2b.flushAny()" invocation obviously was
>>>>> dropped in
>>>>> sun.nio.cs.StreamEncoder. It results in that the "switch back to ASCII
>>>>> mode" sequence is no longer
>>>>> output when OutputStreamWriter.flush() or PrintStream.write(String) is
>>>>> invoked.
>>>>>
>>>>> This does not trigger problem for most use scenario, if the "stream"
>>>>> finally gets closed
>>>>> (in which the StreamEncoder does invoke encoder's flush() to output the
>>>>> escape sequence
>>>>> to switch back to ASCII) or PrintStream.println(String) is used (in which
>>>>> it outputs a \n character,
>>>>> since this \n is in ASCII range, it "accidentally " switches the mode back
>>>>> to ASCII).
>>>>>
>>>>> But it obviously causes problem when you can't not close the
>>>>> OutputStreamWriter after
>>>>> you're done your iso2022-jp writing (for example, you need continue to use
>>>>> the underlying
>>>>> OutputStream for other writing, but not "this" osw), for 1.3.1, these apps
>>>>> invoke osw.flush()
>>>>> to force the output switch back to ASCII, this no longer works when we
>>>>> switch to java.nio.charset
>>>>> in jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2). This is
>>>>> what happened in JavaMail,
>>>>> as described in the bug report.
>>>>>
>>>>> The solution is to re-store the "flush the encoder" mechanism in
>>>>> StreamEncoder's flushBuffer().
>>>>>
>>>>> I have been hesitated to make this change for a while, mostly because this
>>>>> regressed behavior
>>>>> has been their for 3 releases, and the change triggers yet another
>>>>> "behavior change". But given
>>>>> there is no obvious workaround and it only changes the behavior of the
>>>>> charsets with this
>>>>> shift in/out mechanism, mainly the iso-2022 family and those IBM
>>>>> EBCDIC_DBCS charsets, I
>>>>> decided to give it a try.
>>>>>
>>>>> Here is the webreview
>>>>>
>>>>> http://cr.openjdk.java.net/~sherman/6995537/webrev
>>>>>
>>>>> Sherman
>>>>>
>>>>>
>>>>> ---------------------------------1.3.1
>>>>> OutputStreamWriter-----------------------
>>>>> /**
>>>>> * Flush the output buffer to the underlying byte stream, without flushing
>>>>> * the byte stream itself. This method is non-private only so that it may
>>>>> * be invoked by PrintStream.
>>>>> */
>>>>> void flushBuffer() throws IOException {
>>>>> synchronized (lock) {
>>>>> ensureOpen();
>>>>>
>>>>> for (;;) {
>>>>> try {
>>>>> nextByte += ctb.flushAny(bb, nextByte, nBytes);
>>>>> }
>>>>> catch (ConversionBufferFullException x) {
>>>>> nextByte = ctb.nextByteIndex();
>>>>> }
>>>>> if (nextByte == 0)
>>>>> break;
>>>>> if (nextByte > 0) {
>>>>> out.write(bb, 0, nextByte);
>>>>> nextByte = 0;
>>>>> }
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> /**
>>>>> * Flush the stream.
>>>>> *
>>>>> * @exception IOException If an I/O error occurs
>>>>> */
>>>>> public void flush() throws IOException {
>>>>> synchronized (lock) {
>>>>> flushBuffer();
>>>>> out.flush();
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>