Different error decoding Shift-JIS sequence in JDK8
Xueming Shen
xueming.shen at oracle.com
Fri Nov 29 19:25:39 UTC 2013
Hi Charles,
My apology for the late response. I was on vacation the past week and
did not have full email
access.
As Sean pointed out, this is triggered by the change we just put in
recently for 8008386, in which
tried to address a strong request that asked for case like 'fe' '40' to
be treated as 1 malformed
byte + a mappable ascii 40. The reasoning appears to be in case like
this, the decoder should
assume the first byte "fe" is incorrectly transferred during
communication..., treating them as
a pair causes valuable information, the next byte, get dropped. And this
was a regression of
jdk6 (from jdk5).
As a matter of fact, the reason we made the change in jdk6 was because
of a similar case
of your use scenario:-( So it appears we are between a rock and a hard
wall...
That said, I have to admitted in case of fe 40, it might be more
reasonable to treat it as
unmappable-2-byte, in stead of a malformed leading byte followed by a
mappable ascii.
I need to take a little more time to review the whole situation and see
if we can have some
compromise here.
Btw, if would be helpful if you can provide a little more details
regarding your use scenario,
as you mentioned in your email.
"We use this mechanism in JRuby for transcoding arbitrary byte[] from one
encoding to another."
Thanks!
-Sherman
On 11/28/13 1:31 AM, Charles Oliver Nutter wrote:
> What incantation is required to get Sherman to offer his perspective? :-)
>
> I accept that it may be after Thanksgiving, but I need to know the
> situation since we have tests for JRuby that depended on this
> character being valid Shift-JIS.
>
> - Charlie
>
> On Mon, Nov 25, 2013 at 4:08 AM, Seán Coffey <sean.coffey at oracle.com> wrote:
>> Sherman can answer this best. The 8008386 fix for 8 differs from earlier
>> updates since alot of the code was rewritten in this area. The initial
>> report was identified as a regression in JDK6. Back in 2005, the 6227339 fix
>> changed behaviour which meant that invalid single byte characters were
>> treated incorrectly when decoding Shift_JIS encoded bytes. It meant that two
>> bytes are decoded to a "?" character rather than one. The valid single byte
>> characters are lost as a result and I believe this was all unintended when
>> the 6227339 fix was made.
>>
>> Changes made in 8008386 mean that the case of a malformed character (legal
>> leading byte) followed by a valid single byte should now return a
>> replacement character for the first malformed byte and a correctly decoded
>> single byte character.
>>
>> regards,
>> Sean.
>>
>>
>> On 22/11/2013 13:20, Alan Bateman wrote:
>>> On 22/11/2013 11:02, Charles Oliver Nutter wrote:
>>>> Apologies if this is not the correct place to post this, bit i18n
>>>> seemed more focused on languages and localization than the mechanics
>>>> of transcoding.
>>>>
>>>> I have noticed a behavioral difference in JDK8 decoding a two-byte
>>>> Shift-JIS sequence. Specifically, JDK8 appears to report malformed
>>>> input for what should be a valid Shift-JIS sequence, where JDK7
>>>> reported that the character was unmappable.
>>> I assume this is related to JDK-8008386 [1] and I'm sure Sherman or Sean
>>> will jump in to explain this (which seems to be related to a long standing
>>> regression).
>>>
>>> -Alan
>>>
>>> [1] https://bugs.openjdk.java.net/browse/JDK-8008386
>>
>>> Apologies if this is not the correct place to post this, bit i18n
>>> seemed more focused on languages and localization than the mechanics
>>> of transcoding.
>>>
>>> I have noticed a behavioral difference in JDK8 decoding a two-byte
>>> Shift-JIS sequence. Specifically, JDK8 appears to report malformed
>>> input for what should be a valid Shift-JIS sequence, where JDK7
>>> reported that the character was unmappable.
>>>
>>> The code to reproduce is fairly simple:
>>>
>>> byte[] bytes = {(byte)0xEF, 0x40};
>>> CharsetDecoder decoder = Charset.forName("Shift-JIS").newDecoder();
>>> System.out.println(decoder.decode(ByteBuffer.wrap(bytes),
>>> CharBuffer.allocate(2), false));
>>>
>>> Note that this is pumping the decoder directly and specifying partial
>>> input (false). We use this mechanism in JRuby for transcoding
>>> arbitrary byte[] from one encoding to another.
>>>
>>> The result of running this on JDK7 is "UNMAPPABLE[2]" while the result
>>> on JDK8 is "MALFORMED[1]".
>>>
>>> Information online is spotty as to whether this sequence is valid. It
>>> does appear on the table for [JIS X
>>> 203](http://x0213.org/codetable/sjis-0213-2004-std.txt) and several
>>> articles on Shift-JIS claim that it is at worst undefined and at best
>>> valid. So I'm leaning toward this being a bug in JDK8's Shift-JIS
>>> decoder.
>>>
>>> Note that on JDK7 it is "unmappable", which may mean this code
>>> represents a character with no equivalent in Unicode.
>>>
>>> I have uploaded my code to github here:
>>> https://github.com/headius/jdk8_utf8_decoding_bug
>>>
>>> Thoughts?
>>>
>>> - Charlie
>>
>>
More information about the core-libs-dev
mailing list