possible problem with JNI GetStringUTFChars

Sat Jan 26 02:40:53 UTC 2019

On 26/01/2019 3:29 am, Alan Snyder wrote:
> My question was not about why it does what it does, but why it still does that. Is there a valid use of this primitive that depends upon it returning something other than true UTF-8?

It still does what it does because that was how it was specified 20+ 
years ago and there's been no reason to change.

> It may not have been an issue to you, but it was to me when I discovered my program could not handle certain file names. I’ll bet I’m not the last person to assume that a primitive named GetStringUTFChars returns UTF.

It does return chars in a UTF (Unicode transformation format) - that 
format is a modified UTF-8 format. It isn't named GetStringUTF8Chars.

The documentation is quite clear:

GetStringUTFChars

const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean 
*isCopy);

Returns a pointer to an array of bytes representing the string in 
modified UTF-8 encoding.

---

David
-----

> I have fixed my code, so its not an issue for me any more, but it seems like an unnecessary tarpit awaiting the unwary.
> 
> Just my 2c.
> 
>    Alan
> 
> 
>> On Jan 24, 2019, at 10:04 PM, David Holmes <david.holmes at oracle.com> wrote:
>>
>> On 25/01/2019 4:39 am, Alan Snyder wrote:
>>> Thank you. That post does explain what is happening, but leaves open the question of whether GetStringUTFChars should be changed.
>>> What is the value of the current implementation of GetStringUTFChars versus one that returns true UTF-8?
>>
>> Well that's really a Hotspot question as it concerns JNI, but this is ancient history. There's little point musing over the "why" of decisions made back in the late 1990's. But I suspect the main reason is the avoidance of embedded NUL characters.
>>
>> The only bug report I can see on this (basically the same issue you are reporting) was back in 2004:
>>
>> https://bugs.openjdk.java.net/browse/JDK-5030776
>>
>> so it simply has not been an issue. As per the SO article that Claes referenced anyone needing true UTF8 has a couple of paths to achieve that.
>>
>> Cheers,
>> David
>> -----
>>
>>
>>>    Alan
>>>> On Jan 24, 2019, at 10:32 AM, Claes Redestad <claes.redestad at oracle.com> wrote:
>>>>
>>>> Hi Alan,
>>>>
>>>> GetStringUTFChars unfortunately doesn't give you true UTF-8, but a modified UTF-8 sequence
>>>> as used by the VM internally for historical reasons.
>>>>
>>>> See answers to this related question on SO (which contains links to official docs):
>>>> https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
>>>>
>>>> HTH
>>>>
>>>> /Claes
>>>>
>>>> On 2019-01-24 19:23, Alan Snyder wrote:
>>>>> I am having a problem with file names that contain emojis when passed to a macOS system call.
>>>>>
>>>>> Things work when I convert the path to bytes in Java, but fail (file not found) when I convert the path to bytes in native code using GetStringUTFChars.
>>>>>
>>>>> For example, where String.getBytes() returns
>>>>>
>>>>> -16 -97 -115 -69
>>>>>
>>>>> GetStringUTFChars returns:
>>>>>
>>>>> -19 -96 -68 -19 -67 -69
>>>>>
>>>>> I’m not a UTF expert, so can someone say whether I should file a bug report?
>>>>>
>>>>> (Tested in JDK 9, 11, and a fairly recent 12)
>>>>>
>>>>
>>
>