RFR: 8195129: System.load() fails to load from unicode paths [v3]

Fri Jun 4 17:35:58 UTC 2021

On Fri, 4 Jun 2021 14:00:25 GMT, Maxim Kartashev <github.com+28651297+mkartashev at openjdk.org> wrote:

>> Not an expert by my understanding is that the VM only deals with modified UTF-8, as does JNI. So the incoming string should be modified-UTF8 IMO and then converted to UTF16.
>> 
>> That said, this is shared code being modified on the JDK side so you can't just change the type of string being passed in without updating all the implementations of os::dll_load to support that!
>
> I think we need to establish some common ground before proceeding further with this fix. It's a bit of a long read; please, bear with me.
> 
> The path name starts its life as a `jstring` in `Java_jdk_internal_loader_NativeLibraries_load()`, its encoding is irrelevant at this point.
> 
> Next, the name has to be passed down to `JVM_LoadLibrary()` that takes `char*`. So we need to convert form `jstring` to `char*` (point (a)). Following that, `os::dll_load()` that actually performs loading in a platform-specific manner also receives `char*`. All platform implementations of `os::dll_load()` pass the path name down to their respective platform's APIs unmodified, but I think that's just incidental and here we have another possible point of conversion (point (b)). Other consumers of the path name are exception(c) and logging(d) messages; they also take `char*`, but potentially of a different encoding.
> 
> Let me try to enumerate all conceivably valid conversions for `JVM_LoadLibrary()` consumption (point (a)):
> 1. jstring -> platform-specific encoding (status quo meaning possibly lossy encoding on Windows and UTF-8 elsewhere AFAICT),
> 2. jstring -> modified UTF-8,
> 3. jstring -> UTF-8.
> 
> This bug [8195129](https://bugs.openjdk.java.net/browse/JDK-8195129) occurs because conversion (1) may loose information on Windows if the platform encoding happens to be NOT UTF-8 (which it often - or even always - is). So that's a no-go and we are left with either (2) or (3).
> 
> On MacOS and Linux, "platform" encoding already is UTF-8 and since all the platform APIs happily consume UTF-8, no further conversion is necessary (neither for actual library loading, nor for log or exception messages; the latter have to convert to UTF-16, but do that under the hood).
> 
> On Windows, we require at least these variants of the path name:
> 1. UTF16 for library loading (Unicode Windows API),
> 2. "platform" encoding for logging (yes, loosing information here, but that's tolerable),
> 3. "platform" (lossy) or UTF8 (lossless) encoding for exception messages (prefer lossless).
> 
> This is what's behind my choice of UTF-8 for the path name encoding as it gets passed down to `JVM_LoadLibrary()`. We can go with modified UTF-8, of course, in which case all platforms - not just Windows - will have to do the conversion on their own, loosing the benefit of the knowledge about the original string encoding (the String.coder field of jstring).

I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8, as it would be the only location in JNI/JVM interface that uses the standard UTF-8. Instead, I would implement `convert_UTF8_to_UTF16` or rather `convert_mUTF8_to_UTF16` with a fairly simple arithmetic logic.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4169