possible problem with JNI GetStringUTFChars

Mon Jan 28 22:10:07 UTC 2019

On 1/26/19 3:19 PM, David Holmes wrote:
> On 27/01/2019 3:08 am, Martin Buchholz wrote:
>> It's a pet peeve that the name GetStringUTFChars is deeply misleading -
>> there are many "UTF"s, and this encoding is meant for use with the JVM
>> only.  The documentation should make it clearer that this is NOT the UTF-8
>> you might expect.
> 
> It does!
> 
> GetStringUTFChars
> 
> const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
> 
> Returns a pointer to an array of bytes representing the string in modified UTF-8 
> encoding.

This is pretty easy to miss, especially if you're not aware that the JVM and the 
JDK have this special concept of "modified UTF-8". Perhaps emphasis should be 
added. Or maybe occurrences of "modified UTF-8" should be changed to be links to 
the section in chapter 3 of the JNI spec where "modified UTF-8" is defined. 
(Making the occurrences be links might be emphasis enough.)

I think it would be far too troublesome to try to migrate the JNI methods to 
process real UTF-8 instead of modified UTF-8. That raises the question, though: 
is there a use case for processing real UTF-8 within JNI? For example, for 
interoperating with external components that expect real UTF-8. If so, perhaps 
some conversion methods could be added.

(From Java code, the Charset encoders/decoders handle real UTF-8, which seems to 
cover most cases. Modified UTF-8 occurs only within serialization and 
Data{Input,Output}Stream.)

Alan Snyder wrote:
> -16 -97 -115 -69

I'll drink to that!

s'marks