<i18n dev> Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7
Xueming Shen
xueming.shen at oracle.com
Fri Feb 10 11:06:28 PST 2012
On 2/10/2012 12:31 AM, Masayoshi Okutsu wrote:
> I tend to agree with Sherman that the real problem is the
> OutputStreamWriter API which isn't good enough to handle various
> encodings. My understanding is that the charset API was introduced in
> 1.4 to deal with the limitations of the java.io and other encoding
> handling issues.
>
> I still don't think it's correct to change the flush semantics. The
> flush method should just flush out any outstanding data to the given
> output stream as defined in java.io.Writer. What if Writer.write(int)
> is used write UTF-16 data with any stateful encoding? Suppose the
> stateful encoding can support supplementary characters which require
> other G0 designations, does the following calling sequence work?
>
> writer.write(highSurrogate);
> writer.flush();
> writer.write(lowSurrogate);
> writer.flush();
No it does not work. But I would be less concerned with such a charset
that we don't have anywhere around, yet.
The real concern is that if you invoke the above sequence, the
implementation actually "buffered" the highSurr
in its internal field "leftoverChar", and you will get "incompatible"
result for above invocation (for charset that
can handle surrogates, such as UTF8), "leftoverChar" would be process
as a single surrogate, if you "flush" the
osw before write the low surr. But, to save the "leftover" as its
internal status is kinda against OutputStreamWriter's
class spec "Note that the characters passed to the write() methods are
not buffered", though I don't see any
better solution for this scenario (you really don't want to have
OutputSteamWriterto have an explicit interface
to handle CoderResult...)
That said, the spec also specifies that
"A /malformed surrogate element/ is a high surrogate that is not
followed by a low surrogate or a low surrogate
that is not preceded by a high surrogate."
So arguably, based on the spec, you are not supposed to invoke
"flush()" between two paired surrogates, if you
want them to be treated as a pair of surrogate for a supplementary
character.
This is what I have been debating with myself for months. As I said in
my previous email, one alternative is to have
a "close ME only" method for layered streams, but it's not going to
solve the problems for any previous releases,
we are talking about 1.4.x, 1.5, 6, and 7. Another ugly one is to have a
"system property" to switch the behavior.
I'm not sure I understand your suggestion of "create a filter...", are
you suggesting to have a new filter stream
class in java.io to handle the "stateful encodings", or you are
suggesting the app like JavaMail should do the filter
stream subclass to deal with this issue?
-Sherman
>
> Of course, this isn't a problem with iso-2022-jp, though.
>
> I think it's a correct fix, not a workaround, to create a filter
> stream to deal with stateful encodings with the java.io API. If it's
> OK to support only 1.4 and later, the java.nio.charset API should be
> used.
>
> Thanks,
> Masayoshi
>
> On 2/10/2012 4:12 AM, Xueming Shen wrote:
>> CCed Bill Shannon.
>>
>> On 02/09/2012 11:10 AM, Xueming Shen wrote:
>>>
>>> CharsetEncoder has the "flush()" method as the last step (of a
>>> series of "encoding" steps) to
>>> flush out any internal state to the output buffer. The issue here is
>>> the the upper level wrapper
>>> class, OutputStreamWriter in our case, doesn't provide a "explicit"
>>> mechanism to let the
>>> user to request a "flush" on the underlying encoder. The only
>>> "guaranteed' mechanism is the
>>> "close()" method, in which it appears it not appropriate to invoke
>>> in some use scenario, such
>>> as the JavaMail.writeTo() case.
>>>
>>> It appears we are lacking of a "close this stream, but not the
>>> underlying stream" mechanism
>>> in our layered/chained streams, I have similar request for this kind
>>> of mechanism in other area,
>>> such as in zip/gzip stream, app wraps a "outputstream" with
>>> zip/gzip, they want to release the
>>> zip/gzip layer after use (to release the native resource, for
>>> example) but want to keep the
>>> underlying stream unclosed. The only workaround now is to wrap the
>>> underlying stream with
>>> a subclass to override the "close()" method, which is really not
>>> desirable.
>>>
>>> The OutputStreamWriter.flush() does not explicitly specify in its
>>> API doc if it should actually
>>> flush the underlying charset encoder (so I would not argue strongly
>>> that this IS a SE bug) but
>>> given it is flushing it's buffer (internal status) and then the
>>> underlying "out" stream, it's
>>> reasonable to consider that the "internal status" of its encoder
>>> also needs to be flushed.
>>> Especially this has been the behavior for releases earlier than 1.4.2.
>>>
>>> As I said, while I have been hesitated to "fix" this problem for a
>>> while (one it has been here
>>> for 3 major releases, two, the API does not explicitly say so) but
>>> as long as we don't have a
>>> reasonable "close-ME-only" mechanism for those layered streams, it
>>> appears to be a
>>> reasonable approach to solve the problem, without having obvious
>>> negative impact.
>>>
>>> -Sherman
>>>
>>> PS: There is another implementation "detail" that the original
>>> iso-2022-jp c2b converter
>>> actually restores the state back to ASCII mode at the end of its
>>> "convert" method, this makes
>>> the analysis a little complicated, but should not change the issue
>>> we are discussing)
>>>
>>>
>>> On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:
>>>> First of all, is this really a Java SE bug? The usage of
>>>> OutputSteamWriter in JavaMail seems to be wrong to me. The writeTo
>>>> method in the bug report doesn't seem to be able to deal with any
>>>> stateful encodings.
>>>>
>>>> Masayoshi
>>>>
>>>> On 2/9/2012 3:26 PM, Xueming Shen wrote:
>>>>> Hi
>>>>>
>>>>> This is a long standing "regression" from 1.3.1 on how
>>>>> OutputStreamWriter.flush()/flushBuffer()
>>>>> handles escape or shift sequence in some of the charset/encoding,
>>>>> for example the ISO-2022-JP.
>>>>>
>>>>> ISO-2022-JP is encoding that starts with ASCII mode and then
>>>>> switches between ASCII andJapanese
>>>>> characters through an escape sequence. For example, the escape
>>>>> sequence ESC $ B (0x1B, 0x24 0x42)
>>>>> is used to indicate the following bytes are Japanese (switch from
>>>>> ASCII mode to Japanese mode), and
>>>>> the ESC ( B (0x1b 0x28 0x42) is used to switch back to ASCII.
>>>>>
>>>>> In Java's sun.io.CharToByteConvert (old generation charset
>>>>> converter) and the nio.io.charset.CharsetEncoder
>>>>> usually switches back forth between ASCII and Japanese modes based
>>>>> on the input character sequence
>>>>> (for example, if you are in ASCII mode, and your next input
>>>>> character is a Japanese, you add the
>>>>> ESC $ B into the output first and then followed the converted
>>>>> input character, or if you are in Japanese
>>>>> mode and your next input is ASCII, you output ESC ( B first to
>>>>> switch the mode and then the ASCII) and
>>>>> switch back to ASCII mode (if the last mode is non-Japanese) if
>>>>> either the encoding is ending or the
>>>>> flush() method gets invoked.
>>>>>
>>>>> In JDK1.3.1, OutputStreamWriter.flushBuffer() explicitly invokes
>>>>> sun.io.c2b's flushAny() to switch
>>>>> back to ASCII mode every time the flush() or flushBuffer() (from
>>>>> PrintStream) gets invoked, as
>>>>> showed at the end of this email. For example, as showed below, the
>>>>> code uses OutputStreamWriter
>>>>> to "write" a Japanese character \u6700 to the underlying stream
>>>>> with iso-2022jp,
>>>>>
>>>>> ByteArrayOutputStream bos = new ByteArrayOutputStream();
>>>>> String str = "\u6700";
>>>>> OutputStreamWriter osw = new OutputStreamWriter(bos,
>>>>> "iso-2022-jp");
>>>>> osw.write(str, 0, str.length());
>>>>>
>>>>> Since the iso-2022-jp starts with ASCII mode, we now have a
>>>>> Japanese, so we need to
>>>>> switch into Japanese mode first (the first 3 bytes) and then the
>>>>> encoded Japanese
>>>>> character (the following 2 bytes)
>>>>>
>>>>> 0x1b 0x24 0x42 0x3a 0x47
>>>>>
>>>>> and then the code invokes
>>>>>
>>>>> osw.flush();
>>>>>
>>>>> since we are now in Japanese, the writer continues to write out
>>>>>
>>>>> 0x1b 0x28 0x 42
>>>>>
>>>>> to switch back to ASCII mode. The total output is 8 bytes after
>>>>> write() and flush().
>>>>>
>>>>> However, when all encoidng/charset related codes were migrated
>>>>> from 1.3.1's sun.io based to
>>>>> 1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2,
>>>>> we gradually migrated from
>>>>> sun.io to java.nio.charset), the "c2b.flushAny()" invocation
>>>>> obviously was dropped in
>>>>> sun.nio.cs.StreamEncoder. It results in that the "switch back to
>>>>> ASCII mode" sequence is no longer
>>>>> output when OutputStreamWriter.flush() or
>>>>> PrintStream.write(String) is invoked.
>>>>>
>>>>> This does not trigger problem for most use scenario, if the
>>>>> "stream" finally gets closed
>>>>> (in which the StreamEncoder does invoke encoder's flush() to
>>>>> output the escape sequence
>>>>> to switch back to ASCII) or PrintStream.println(String) is used
>>>>> (in which it outputs a \n character,
>>>>> since this \n is in ASCII range, it "accidentally " switches the
>>>>> mode back to ASCII).
>>>>>
>>>>> But it obviously causes problem when you can't not close the
>>>>> OutputStreamWriter after
>>>>> you're done your iso2022-jp writing (for example, you need
>>>>> continue to use the underlying
>>>>> OutputStream for other writing, but not "this" osw), for 1.3.1,
>>>>> these apps invoke osw.flush()
>>>>> to force the output switch back to ASCII, this no longer works
>>>>> when we switch to java.nio.charset
>>>>> in jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2).
>>>>> This is what happened in JavaMail,
>>>>> as described in the bug report.
>>>>>
>>>>> The solution is to re-store the "flush the encoder" mechanism in
>>>>> StreamEncoder's flushBuffer().
>>>>>
>>>>> I have been hesitated to make this change for a while, mostly
>>>>> because this regressed behavior
>>>>> has been their for 3 releases, and the change triggers yet another
>>>>> "behavior change". But given
>>>>> there is no obvious workaround and it only changes the behavior of
>>>>> the charsets with this
>>>>> shift in/out mechanism, mainly the iso-2022 family and those IBM
>>>>> EBCDIC_DBCS charsets, I
>>>>> decided to give it a try.
>>>>>
>>>>> Here is the webreview
>>>>>
>>>>> http://cr.openjdk.java.net/~sherman/6995537/webrev
>>>>>
>>>>> Sherman
>>>>>
>>>>>
>>>>> ---------------------------------1.3.1
>>>>> OutputStreamWriter-----------------------
>>>>> /**
>>>>> * Flush the output buffer to the underlying byte stream,
>>>>> without flushing
>>>>> * the byte stream itself. This method is non-private only so
>>>>> that it may
>>>>> * be invoked by PrintStream.
>>>>> */
>>>>> void flushBuffer() throws IOException {
>>>>> synchronized (lock) {
>>>>> ensureOpen();
>>>>>
>>>>> for (;;) {
>>>>> try {
>>>>> nextByte += ctb.flushAny(bb, nextByte, nBytes);
>>>>> }
>>>>> catch (ConversionBufferFullException x) {
>>>>> nextByte = ctb.nextByteIndex();
>>>>> }
>>>>> if (nextByte == 0)
>>>>> break;
>>>>> if (nextByte > 0) {
>>>>> out.write(bb, 0, nextByte);
>>>>> nextByte = 0;
>>>>> }
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> /**
>>>>> * Flush the stream.
>>>>> *
>>>>> * @exception IOException If an I/O error occurs
>>>>> */
>>>>> public void flush() throws IOException {
>>>>> synchronized (lock) {
>>>>> flushBuffer();
>>>>> out.flush();
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20120210/88a6a693/attachment-0001.html
More information about the i18n-dev
mailing list