Different error decoding Shift-JIS sequence in JDK8

Thu Nov 28 09:31:50 UTC 2013

What incantation is required to get Sherman to offer his perspective? :-)

I accept that it may be after Thanksgiving, but I need to know the
situation since we have tests for JRuby that depended on this
character being valid Shift-JIS.

- Charlie

On Mon, Nov 25, 2013 at 4:08 AM, Seán Coffey <sean.coffey at oracle.com> wrote:
> Sherman can answer this best. The 8008386 fix for 8 differs from earlier
> updates since alot of the code was rewritten in this area. The initial
> report was identified as a regression in JDK6. Back in 2005, the 6227339 fix
> changed behaviour which meant that invalid single byte characters were
> treated incorrectly when decoding Shift_JIS encoded bytes. It meant that two
> bytes are decoded to a "?" character rather than one. The valid single byte
> characters are lost as a result and I believe this was all unintended when
> the 6227339 fix was made.
>
> Changes made in 8008386 mean that the case of a malformed character (legal
> leading byte) followed by a valid single byte should now return a
> replacement character for the first malformed byte and a correctly decoded
> single byte character.
>
> regards,
> Sean.
>
>
> On 22/11/2013 13:20, Alan Bateman wrote:
>>
>> On 22/11/2013 11:02, Charles Oliver Nutter wrote:
>>>
>>> Apologies if this is not the correct place to post this, bit i18n
>>> seemed more focused on languages and localization than the mechanics
>>> of transcoding.
>>>
>>> I have noticed a behavioral difference in JDK8 decoding a two-byte
>>> Shift-JIS sequence. Specifically, JDK8 appears to report malformed
>>> input for what should be a valid Shift-JIS sequence, where JDK7
>>> reported that the character was unmappable.
>>
>> I assume this is related to JDK-8008386 [1] and I'm sure Sherman or Sean
>> will jump in to explain this (which seems to be related to a long standing
>> regression).
>>
>> -Alan
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8008386
>
>
>> Apologies if this is not the correct place to post this, bit i18n
>> seemed more focused on languages and localization than the mechanics
>> of transcoding.
>>
>> I have noticed a behavioral difference in JDK8 decoding a two-byte
>> Shift-JIS sequence. Specifically, JDK8 appears to report malformed
>> input for what should be a valid Shift-JIS sequence, where JDK7
>> reported that the character was unmappable.
>>
>> The code to reproduce is fairly simple:
>>
>> byte[] bytes = {(byte)0xEF, 0x40};
>> CharsetDecoder decoder = Charset.forName("Shift-JIS").newDecoder();
>> System.out.println(decoder.decode(ByteBuffer.wrap(bytes),
>> CharBuffer.allocate(2), false));
>>
>> Note that this is pumping the decoder directly and specifying partial
>> input (false). We use this mechanism in JRuby for transcoding
>> arbitrary byte[] from one encoding to another.
>>
>> The result of running this on JDK7 is "UNMAPPABLE[2]" while the result
>> on JDK8 is "MALFORMED[1]".
>>
>> Information online is spotty as to whether this sequence is valid. It
>> does appear on the table for [JIS X
>> 203](http://x0213.org/codetable/sjis-0213-2004-std.txt) and several
>> articles on Shift-JIS claim that it is at worst undefined and at best
>> valid. So I'm leaning toward this being a bug in JDK8's Shift-JIS
>> decoder.
>>
>> Note that on JDK7 it is "unmappable", which may mean this code
>> represents a character with no equivalent in Unicode.
>>
>> I have uploaded my code to github here:
>> https://github.com/headius/jdk8_utf8_decoding_bug
>>
>> Thoughts?
>>
>> - Charlie
>
>
>