<i18n dev> Codereview request for 7082884: Incorrect UTF8 conversion for sequence ED 31
Xueming Shen
xueming.shen at oracle.com
Mon Sep 19 13:21:33 PDT 2011
Hi,
Unicode Standard added "Addition Constraints on conversion of ill-formed
UTF-8"
in version 5.1 [1] and updated in 6.0 again with further "clarification"
[2] regarding
how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says
(1) the conversion process should not interprets any ill-formed code
unit sequence
*(2) such process must not treat any adjacent well-formed code unit
sequences
as being part of those ill-formed code unit sequences
(3) and recommend a "best practice" of "maximal valid subpart" for
replacement
The new UTF-8 charset implementation we put in JDK7 (and back-ported to
previous
release since then) follows the new constraints in most cases. Except 2
corner cases
(we are aware of so far for now"). 7082884 is one of them.
The current implementation decode
new String(new byte[]{(byte)0xed, 31}, "UTF8")
into one single char "\ufffd".
while it should be "\ufffd\u001f" instead, according to the new
constraints (the first
0xed is ill-formed, and the second 0x1f is still valid non-ill-formed,
so it should not
be consumed, even the first byte 0xed is a leading byte a three-byte
utf-8 byte
sequence).
The reason I called it a "corner case" is because then new UTF-8
implementation handles
it correctly in most cases, for example
new String(new byte[]{(byte)0xed, 31, 'a'}, "UTF8");
does return the expected result "\ufffd\u001f\u0061"
The corner case here is that the 0xed is the leading byte of a
three-byte utf-8 byte sequence,
but we actually only 2 bytes total in pipe. The current UTF-8 decoder
implementation
will not even look into the following bytes when it has a leading byte
of a 3-byte utf-8
sequence and it has less than 3 bytes to work on. In this case it simply
returns "underflow",
means "I need more bytes". Unfortunately its upper level implementation,
CharsetDecoder,
will simply treat this "underflow" status as a malformed byte sequence
of "2" (it's reasonable
for CharsetDecoder to make such a decision as well, see the decoder does
not have enough
bytes to handle these 2 bytes, and we don't have any more bytes
following, so the rest are
all "malformed").
The fix is to further look into the following bytes when we have a
leading byte, even don't
have enough bytes to complete the conversion.
The webrev is at
http://cr.openjdk.java.net/~sherman/7082884/webrev
Another corner case is how to deal with the old 5-6 bytes byte sequence,
such as
"fc 80 80 8f bf bf", we are now treating them as 1 malformed utf-8 byte
sequence, so any
of these 5-6 bytes "old" formed will be treated one malformed character
and then replaced
by one "\ufffd". But according to the new "best practice"
recommendation, it probably should
be replaced by 6 \ufffd. if I understand the recommendation correctly.
Personally I feel the
existing implementation is a more reasonable approach, opinion?
Thanks
-Sherman
[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://blogs.oracle.com/xuemingshen/entry/the_big_overhaul_of_java
More information about the i18n-dev
mailing list