possible problem with JNI GetStringUTFChars

Sat Jan 26 19:38:51 UTC 2019

My usage of GetStringUTFChars was to pass a String as a parameter to a system call that takes a NUL-terminated UTF-8 string (a file path). Obviously, the system call does not accept strings containing NUL. I suspect this is a common case.

Therefore, my needs would be met by a (new) primitive that returns UTF-8 and fails if the String contains NUL.

In addition, I would suggest either of these options:

(1) Document GetStringUTFChars as deprecated, introduce a new primitive GetStringCharsInternalRepresentationModifiedUTF, and use C support for deprecated members where available to provide compile-time warnings when GetStringUTFChars is used.

(2) Rename GetStringUTFChars to GetStringCharsInternalRepresentationModifiedUTF. I believe this is a binary compatible change, but new builds will fail, forcing developers to choose which behavior they really want.

  Alan

> On Jan 26, 2019, at 7:30 AM, Claes Redestad <claes.redestad at oracle.com> wrote:
> 
> Modified UTF-8 goes way back in terms of internal use in java and its
> JVMs. It's the format used to store strings in class-files, and used as
> an internal representation in the HotSpot VM: various internal string
> tables, constant pools etc.
> 
> So any Java code that interacts with the VM needs to know how to convert
> back and forth between java Strings and the VMs flavor of modified
> UTF-8. As long as the JVM speak modified UTF-8 internally, we'll need
> the utilities to convert back and forth. Changing this fundamental
> design is likely to be way more trouble than it's ever worth.
> 
> As to "why do the VM do this!?", I'm too young to really know for sure,
> but it's fun to speculate.[1]
> 
> I think we all welcome constructive suggestions on how to help
> developers notice that the "UTF" JNI methods aren't what your intuition
> might tell you. I've been there myself and learned about modified UTF-8
> the hard way.
> 
> /Claes
> 
> [1]
> 
> It turns out there are a few obvious technical difficulties with UTF-8,
> especially dealing with strings that encode '\0' characters (a.k.a.
> null) in the context of C/C++ code. C-strings (char*) are null-
> terminated, and there's a lot of code and utilities that'd break or
> behave weirdly if you give them char*s with embedded nulls in them...
> 
> But UTF-8 is still mostly an attractive, compact encoding for the kind
> of strings JVMs care about: most of them are ASCII String literals for
> methods and fields encoded into classfiles, and UTF-8 encode ASCII
> without any overhead!
> 
> But it allows null chars, and to support that you need to encode the
> length.. Ugh, overhead! Can't have that! What to do?!
> 
> The designers likely thought it'd be less trouble modifying this new,
> shiny UTF-8 encoding to get something similar to it that disallows
> embedded nulls. And why not: it's only for a Java/JVM-internal stuff no-
> one on the outside needs to know about, right? And it's *mostly*
> compatible. And no-one uses real UTF-8, anyhow!
> 
> The context here is that Unicode and UTF-8 was still relatively new
> (RFCs filed 1993 and 1996, respectively). The fact that it'd eventually
> become the de facto encoding standard was not something anyone could
> have known back then.
> 
> As it happens, "modified UTF-8" took root in the emerging world of JVMs,
> and spread to a number of surprising places throughout the Java SE
> libraries, like java.io.DataInput/Output:
> https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#modified-utf-8 
> 
> Today, in C++14, std::string supports embedding '\0' values, and is thus
> much more UTF-8 friendly than good old C-strings. I find it unlikely
> that "modified UTF-8" would be a thing if the JVM was designed from
> scratch today ("C++ in 2019!?").
> 
> On 2019-01-26 05:24, Alan Snyder wrote:
>> The reason to change is that returning UTF-8 is useful and returning “modified UTF-8” is apparently not (as no one has explained why it is useful).
>> Why not deprecate it?
>> It would be nice to get a warning.
>>   Alan
>>> On Jan 25, 2019, at 6:40 PM, David Holmes <david.holmes at oracle.com> wrote:
>>> 
>>> On 26/01/2019 3:29 am, Alan Snyder wrote:
>>>> My question was not about why it does what it does, but why it still does that. Is there a valid use of this primitive that depends upon it returning something other than true UTF-8?
>>> 
>>> It still does what it does because that was how it was specified 20+ years ago and there's been no reason to change.
>>> 
>>>> It may not have been an issue to you, but it was to me when I discovered my program could not handle certain file names. I’ll bet I’m not the last person to assume that a primitive named GetStringUTFChars returns UTF.
>>> 
>>> It does return chars in a UTF (Unicode transformation format) - that format is a modified UTF-8 format. It isn't named GetStringUTF8Chars.
>>> 
>>> The documentation is quite clear:
>>> 
>>> GetStringUTFChars
>>> 
>>> const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
>>> 
>>> Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding.
>>> 
>>> ---
>>> 
>>> David
>>> -----
>>> 
>>>> I have fixed my code, so its not an issue for me any more, but it seems like an unnecessary tarpit awaiting the unwary.
>>>> Just my 2c.
>>>>   Alan
>>>>> On Jan 24, 2019, at 10:04 PM, David Holmes <david.holmes at oracle.com> wrote:
>>>>> 
>>>>> On 25/01/2019 4:39 am, Alan Snyder wrote:
>>>>>> Thank you. That post does explain what is happening, but leaves open the question of whether GetStringUTFChars should be changed.
>>>>>> What is the value of the current implementation of GetStringUTFChars versus one that returns true UTF-8?
>>>>> 
>>>>> Well that's really a Hotspot question as it concerns JNI, but this is ancient history. There's little point musing over the "why" of decisions made back in the late 1990's. But I suspect the main reason is the avoidance of embedded NUL characters.
>>>>> 
>>>>> The only bug report I can see on this (basically the same issue you are reporting) was back in 2004:
>>>>> 
>>>>> https://bugs.openjdk.java.net/browse/JDK-5030776
>>>>> 
>>>>> so it simply has not been an issue. As per the SO article that Claes referenced anyone needing true UTF8 has a couple of paths to achieve that.
>>>>> 
>>>>> Cheers,
>>>>> David
>>>>> -----
>>>>> 
>>>>> 
>>>>>>   Alan
>>>>>>> On Jan 24, 2019, at 10:32 AM, Claes Redestad <claes.redestad at oracle.com> wrote:
>>>>>>> 
>>>>>>> Hi Alan,
>>>>>>> 
>>>>>>> GetStringUTFChars unfortunately doesn't give you true UTF-8, but a modified UTF-8 sequence
>>>>>>> as used by the VM internally for historical reasons.
>>>>>>> 
>>>>>>> See answers to this related question on SO (which contains links to official docs):
>>>>>>> https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> /Claes
>>>>>>> 
>>>>>>> On 2019-01-24 19:23, Alan Snyder wrote:
>>>>>>>> I am having a problem with file names that contain emojis when passed to a macOS system call.
>>>>>>>> 
>>>>>>>> Things work when I convert the path to bytes in Java, but fail (file not found) when I convert the path to bytes in native code using GetStringUTFChars.
>>>>>>>> 
>>>>>>>> For example, where String.getBytes() returns
>>>>>>>> 
>>>>>>>> -16 -97 -115 -69
>>>>>>>> 
>>>>>>>> GetStringUTFChars returns:
>>>>>>>> 
>>>>>>>> -19 -96 -68 -19 -67 -69
>>>>>>>> 
>>>>>>>> I’m not a UTF expert, so can someone say whether I should file a bug report?
>>>>>>>> 
>>>>>>>> (Tested in JDK 9, 11, and a fairly recent 12)
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>