JNI UTF-8 encoding bug with some characters

Tue Jun 5 17:49:47 UTC 2012

Hi Ariel,

The Java UTF-8 charset (sun.nio.cs.UTF_8) is updated back to jdk7 to 
follow Unicode
Corrigendum [1] (CR#4486841) and is furthered updated in JDK8 (#7096080) to
fully conform with the Standard. As the result, the Java UTF-8 charset now
only encodes and decodes supplementary character into 4 bytes utf-8 byte
sequence. However, we did not do the same thing for vm's jni-utf-8
implementation, which still encode/decodes the supplementary into 6 bytes
(pair of surrogates, 3 bytes each). This was the decision we made back then
with the assumption that the jni-utf-8 is mainly for "internal" information
exchange (you are not supposed to use the result to exchange the information
with an "external" system), as long as it provides a round-trip conversion,
should be not an issue. The character you are using here is a supplementary
character, this is why you are seeing the difference here.

-Sherman

[1] http://www.unicode.org/versions/corrigendum1.html

On 06/05/2012 09:06 AM, Ariel Weisberg wrote:
> Hi,
>
> Here is a link to an updated test case that simplifies the string being
> tested to just the problem character, and fixes a bug in determining the
> length of the array returned by GetStringUTFChars.
>
> https://s3.amazonaws.com/com.voltdb.aweisberg/utf8_encoding_bug2.tgz
>
> Thanks,
> Ariel
>
> On Tue, Jun 5, 2012, at 11:38 AM, Ariel Weisberg wrote:
>> Hi all,
>>
>> Not sure what list this should go to.
>>
>> I found an issue with JNI's GetStringUTFChars which is supposed to
>> return a Java string in UTF-8 encoding. There is an attached test case.
>> I tested on Ubuntu 12.04 (Linux aweisberg-desktop 2.6.32-41-generic
>> #89-Ubuntu SMP Fri Apr 27 22:18:56 UTC 2012 x86_64 GNU/Linux) and CentOS
>> 5 (Linux volt3b 2.6.18-308.4.1.el5 #1 SMP Tue Apr 17 17:08:00 EDT 2012
>> x86_64 x86_64 x86_64 GNU/Linux) with JDK 6 update 32 and JDK 7 update 4.
>>
>> For the following string "â��x一xxéyyԱ" I find that the first character is
>> encoded correctly, but the second character
>> (http://www.fileformat.info/info/unicode/char/1f032/index.htm) comes out
>> with an invalid code point.
>>
>> The result of String.getBytes("UTF-8") is
>> c3a2f09f80b278e4b8807878c3a97979d4b1 and this matches the output I get
>> from defining the string as a constant in C++.
>>
>> The result of GetStringUTFChars is c3a2eda0bcedb0b278e4b8.
>>
>> See this test case
>> (https://s3.amazonaws.com/com.voltdb.aweisberg/utf8_encoding_bug.tgz)
>> for a reproducer and how I displayed the values.
>>
>> Thanks,
>> Ariel