possible problem with JNI GetStringUTFChars

Tue Jan 29 23:24:06 UTC 2019

> In case you missed my previous message, there is a use case for file paths using macOS APIs.

Hm, Martin had mentioned that macOS uses something more restrictive than UTF-8. 
It seems to me that a filesystem-specific encoding is called for here.

> If you search the JDK repo for GetStringUTFChars, you will find several uses 
> that do not appear to involve serialization or data input/output.

To clarify, I was talking about uses of modified UTF-8 from *Java* code. The 
only places modified UTF-8 should appear in Java code are (I think) in 
serialization and in Data*Stream.

Native code needs to use modified UTF-8 because it's required for various JVM 
interfaces.

> It is not obvious whether these uses are correct or not.
> 
> Consider test/jdk/java/nio/channels/FileChannel/directio/libDirectIO.c
> 
> where GetStringUTFChars is used to convert a file path to pass to open().
> 
> At the very least, anyone using GetStringUTFChars as a short cut for true UTF-8 
> conversion should document why this short cut is correct, as is done in 
> awt_InputMethod, for example.

Correct. If there are places that use GetStringUTFChars is used where real UTF-8 
is required, then that's quite possibly a bug.

The use in libDirectIO.c is certainly suspicious. Note that this is test code, 
and the only strings that are passed to it are temp file names from 
Files.createTempFile(). It seems likely that such strings contain non-null BMP 
characters, for which modified UTF-8 and real UTF-8 are the same, so this is 
unlikely to be a problem in practice.

Still, you're right, if there are places where the JDK uses GetStringUTFChars 
where real UTF-8 is required, those would be bugs.

**

Anyway, I think it's unfortunate, but in the JNI world we're saddled with 
modified UTF-8. If you need real UTF-8, I recommend you do the conversion in 
Java before you get down to native. The reason is that there are some edge cases 
with codeset conversion (e.g., malformed sequences such as unpaired surrogates) 
that would require a bunch of additional facilities that aren't readily 
available from native code, as far as I know.

s'marks

> 
>    Alan
> 
> 
> 
>> On Jan 28, 2019, at 2:10 PM, Stuart Marks <stuart.marks at oracle.com 
>> <mailto:stuart.marks at oracle.com>> wrote:
>>
>> (From Java code, the Charset encoders/decoders handle real UTF-8, which seems 
>> to cover most cases. Modified UTF-8 occurs only within serialization and 
>> Data{Input,Output}Stream.)
>