Different error decoding Shift-JIS sequence in JDK8

Mon Nov 25 10:08:34 UTC 2013

Sherman can answer this best. The 8008386 fix for 8 differs from earlier 
updates since alot of the code was rewritten in this area. The initial 
report was identified as a regression in JDK6. Back in 2005, the 6227339 
fix changed behaviour which meant that invalid single byte characters 
were treated incorrectly when decoding Shift_JIS encoded bytes. It meant 
that two bytes are decoded to a "?" character rather than one. The valid 
single byte characters are lost as a result and I believe this was all 
unintended when the 6227339 fix was made.

Changes made in 8008386 mean that the case of a malformed character 
(legal leading byte) followed by a valid single byte should now return a 
replacement character for the first malformed byte and a correctly 
decoded single byte character.

regards,
Sean.

On 22/11/2013 13:20, Alan Bateman wrote:
> On 22/11/2013 11:02, Charles Oliver Nutter wrote:
>> Apologies if this is not the correct place to post this, bit i18n
>> seemed more focused on languages and localization than the mechanics
>> of transcoding.
>>
>> I have noticed a behavioral difference in JDK8 decoding a two-byte
>> Shift-JIS sequence. Specifically, JDK8 appears to report malformed
>> input for what should be a valid Shift-JIS sequence, where JDK7
>> reported that the character was unmappable.
> I assume this is related to JDK-8008386 [1] and I'm sure Sherman or 
> Sean will jump in to explain this (which seems to be related to a long 
> standing regression).
>
> -Alan
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8008386

> Apologies if this is not the correct place to post this, bit i18n
> seemed more focused on languages and localization than the mechanics
> of transcoding.
>
> I have noticed a behavioral difference in JDK8 decoding a two-byte
> Shift-JIS sequence. Specifically, JDK8 appears to report malformed
> input for what should be a valid Shift-JIS sequence, where JDK7
> reported that the character was unmappable.
>
> The code to reproduce is fairly simple:
>
> byte[] bytes = {(byte)0xEF, 0x40};
> CharsetDecoder decoder = Charset.forName("Shift-JIS").newDecoder();
> System.out.println(decoder.decode(ByteBuffer.wrap(bytes),
> CharBuffer.allocate(2), false));
>
> Note that this is pumping the decoder directly and specifying partial
> input (false). We use this mechanism in JRuby for transcoding
> arbitrary byte[] from one encoding to another.
>
> The result of running this on JDK7 is "UNMAPPABLE[2]" while the result
> on JDK8 is "MALFORMED[1]".
>
> Information online is spotty as to whether this sequence is valid. It
> does appear on the table for [JIS X
> 203](http://x0213.org/codetable/sjis-0213-2004-std.txt) and several
> articles on Shift-JIS claim that it is at worst undefined and at best
> valid. So I'm leaning toward this being a bug in JDK8's Shift-JIS
> decoder.
>
> Note that on JDK7 it is "unmappable", which may mean this code
> represents a character with no equivalent in Unicode.
>
> I have uploaded my code to github here:
> https://github.com/headius/jdk8_utf8_decoding_bug
>
> Thoughts?
>
> - Charlie