possible problem with JNI GetStringUTFChars

Sat Jan 26 15:30:06 UTC 2019

Modified UTF-8 goes way back in terms of internal use in java and its
JVMs. It's the format used to store strings in class-files, and used as
an internal representation in the HotSpot VM: various internal string
tables, constant pools etc.

So any Java code that interacts with the VM needs to know how to convert
back and forth between java Strings and the VMs flavor of modified
UTF-8. As long as the JVM speak modified UTF-8 internally, we'll need
the utilities to convert back and forth. Changing this fundamental
design is likely to be way more trouble than it's ever worth.

As to "why do the VM do this!?", I'm too young to really know for sure,
but it's fun to speculate.[1]

I think we all welcome constructive suggestions on how to help
developers notice that the "UTF" JNI methods aren't what your intuition
might tell you. I've been there myself and learned about modified UTF-8
the hard way.

/Claes

[1]

It turns out there are a few obvious technical difficulties with UTF-8,
especially dealing with strings that encode '\0' characters (a.k.a.
null) in the context of C/C++ code. C-strings (char*) are null-
terminated, and there's a lot of code and utilities that'd break or
behave weirdly if you give them char*s with embedded nulls in them...

But UTF-8 is still mostly an attractive, compact encoding for the kind
of strings JVMs care about: most of them are ASCII String literals for
methods and fields encoded into classfiles, and UTF-8 encode ASCII
without any overhead!

But it allows null chars, and to support that you need to encode the
length.. Ugh, overhead! Can't have that! What to do?!

The designers likely thought it'd be less trouble modifying this new,
shiny UTF-8 encoding to get something similar to it that disallows
embedded nulls. And why not: it's only for a Java/JVM-internal stuff no-
one on the outside needs to know about, right? And it's *mostly*
compatible. And no-one uses real UTF-8, anyhow!

The context here is that Unicode and UTF-8 was still relatively new
(RFCs filed 1993 and 1996, respectively). The fact that it'd eventually
become the de facto encoding standard was not something anyone could
have known back then.

As it happens, "modified UTF-8" took root in the emerging world of JVMs,
and spread to a number of surprising places throughout the Java SE
libraries, like java.io.DataInput/Output:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#modified-utf-8 

Today, in C++14, std::string supports embedding '\0' values, and is thus
much more UTF-8 friendly than good old C-strings. I find it unlikely
that "modified UTF-8" would be a thing if the JVM was designed from
scratch today ("C++ in 2019!?").

On 2019-01-26 05:24, Alan Snyder wrote:
> The reason to change is that returning UTF-8 is useful and returning “modified UTF-8” is apparently not (as no one has explained why it is useful).
> 
> Why not deprecate it?
> 
> It would be nice to get a warning.
> 
>    Alan
> 
> 
>> On Jan 25, 2019, at 6:40 PM, David Holmes <david.holmes at oracle.com> wrote:
>>
>> On 26/01/2019 3:29 am, Alan Snyder wrote:
>>> My question was not about why it does what it does, but why it still does that. Is there a valid use of this primitive that depends upon it returning something other than true UTF-8?
>>
>> It still does what it does because that was how it was specified 20+ years ago and there's been no reason to change.
>>
>>> It may not have been an issue to you, but it was to me when I discovered my program could not handle certain file names. I’ll bet I’m not the last person to assume that a primitive named GetStringUTFChars returns UTF.
>>
>> It does return chars in a UTF (Unicode transformation format) - that format is a modified UTF-8 format. It isn't named GetStringUTF8Chars.
>>
>> The documentation is quite clear:
>>
>> GetStringUTFChars
>>
>> const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
>>
>> Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding.
>>
>> ---
>>
>> David
>> -----
>>
>>> I have fixed my code, so its not an issue for me any more, but it seems like an unnecessary tarpit awaiting the unwary.
>>> Just my 2c.
>>>    Alan
>>>> On Jan 24, 2019, at 10:04 PM, David Holmes <david.holmes at oracle.com> wrote:
>>>>
>>>> On 25/01/2019 4:39 am, Alan Snyder wrote:
>>>>> Thank you. That post does explain what is happening, but leaves open the question of whether GetStringUTFChars should be changed.
>>>>> What is the value of the current implementation of GetStringUTFChars versus one that returns true UTF-8?
>>>>
>>>> Well that's really a Hotspot question as it concerns JNI, but this is ancient history. There's little point musing over the "why" of decisions made back in the late 1990's. But I suspect the main reason is the avoidance of embedded NUL characters.
>>>>
>>>> The only bug report I can see on this (basically the same issue you are reporting) was back in 2004:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-5030776
>>>>
>>>> so it simply has not been an issue. As per the SO article that Claes referenced anyone needing true UTF8 has a couple of paths to achieve that.
>>>>
>>>> Cheers,
>>>> David
>>>> -----
>>>>
>>>>
>>>>>    Alan
>>>>>> On Jan 24, 2019, at 10:32 AM, Claes Redestad <claes.redestad at oracle.com> wrote:
>>>>>>
>>>>>> Hi Alan,
>>>>>>
>>>>>> GetStringUTFChars unfortunately doesn't give you true UTF-8, but a modified UTF-8 sequence
>>>>>> as used by the VM internally for historical reasons.
>>>>>>
>>>>>> See answers to this related question on SO (which contains links to official docs):
>>>>>> https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> /Claes
>>>>>>
>>>>>> On 2019-01-24 19:23, Alan Snyder wrote:
>>>>>>> I am having a problem with file names that contain emojis when passed to a macOS system call.
>>>>>>>
>>>>>>> Things work when I convert the path to bytes in Java, but fail (file not found) when I convert the path to bytes in native code using GetStringUTFChars.
>>>>>>>
>>>>>>> For example, where String.getBytes() returns
>>>>>>>
>>>>>>> -16 -97 -115 -69
>>>>>>>
>>>>>>> GetStringUTFChars returns:
>>>>>>>
>>>>>>> -19 -96 -68 -19 -67 -69
>>>>>>>
>>>>>>> I’m not a UTF expert, so can someone say whether I should file a bug report?
>>>>>>>
>>>>>>> (Tested in JDK 9, 11, and a fairly recent 12)
>>>>>>>
>>>>>>
>>>>
>>
>