Codereview request for 7082884: Incorrect UTF8 conversion for sequence ED 31

Mon Sep 19 20:21:33 UTC 2011

Hi,

Unicode Standard added "Addition Constraints on conversion of ill-formed 
UTF-8"
in version 5.1 [1] and updated in 6.0 again with further "clarification" 
[2] regarding
how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says

  (1) the conversion process should not interprets any ill-formed code 
unit sequence
*(2) such process must not treat any adjacent well-formed code unit 
sequences
      as being part of those ill-formed code unit sequences
  (3) and recommend a "best practice" of "maximal valid subpart" for 
replacement

The new UTF-8 charset implementation we put in JDK7 (and back-ported  to 
previous
release since then) follows the new constraints in most cases. Except 2 
corner cases
(we are aware of so far for now"). 7082884 is one of them.

The current implementation decode

new String(new byte[]{(byte)0xed, 31}, "UTF8")

into one single char "\ufffd".

while it should be "\ufffd\u001f" instead, according to the new 
constraints (the first
0xed is ill-formed, and the second 0x1f is still valid non-ill-formed, 
so it should not
be consumed, even the first byte 0xed is a leading byte a three-byte 
utf-8 byte
sequence).

The reason I called it a "corner case" is because then new UTF-8 
implementation handles
it correctly in most cases, for example

new String(new byte[]{(byte)0xed, 31, 'a'}, "UTF8");

does return the expected result "\ufffd\u001f\u0061"

The corner case here is that the 0xed is the leading byte of a 
three-byte utf-8 byte sequence,
but we actually only 2 bytes total in pipe. The current UTF-8 decoder 
implementation
will not even look into the following bytes when it has a leading byte 
of a 3-byte utf-8
sequence and it has less than 3 bytes to work on. In this case it simply 
returns "underflow",
means "I need more bytes". Unfortunately its upper level implementation, 
CharsetDecoder,
will simply treat this "underflow" status as a malformed byte sequence 
of "2" (it's reasonable
for CharsetDecoder to make such a decision as well, see the decoder does 
not have enough
bytes to handle these 2 bytes, and we don't have any more bytes 
following, so the rest are
all "malformed").

The fix is to further look into the following bytes when we have a 
leading byte, even don't
have enough bytes to complete the conversion.

The webrev is at

http://cr.openjdk.java.net/~sherman/7082884/webrev

Another corner case is how to deal with the old 5-6 bytes byte sequence, 
such as
"fc 80 80 8f bf bf", we are now treating them as 1 malformed utf-8 byte 
sequence, so any
of these 5-6 bytes "old" formed will be treated one malformed character 
and then replaced
by one "\ufffd". But according to the new "best practice" 
recommendation, it probably should
be replaced by 6 \ufffd. if I understand the recommendation correctly.  
Personally I feel the
existing implementation is a more reasonable approach, opinion?

Thanks
-Sherman

[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://blogs.oracle.com/xuemingshen/entry/the_big_overhaul_of_java