RFR: 8195129: System.load() fails to load from unicode paths [v3]

Mon Jun 7 11:03:04 UTC 2021

On Sun, 6 Jun 2021 22:25:44 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> I think we need to establish some common ground before proceeding further with this fix. It's a bit of a long read; please, bear with me.
>> 
>> The path name starts its life as a `jstring` in `Java_jdk_internal_loader_NativeLibraries_load()`, its encoding is irrelevant at this point.
>> 
>> Next, the name has to be passed down to `JVM_LoadLibrary()` that takes `char*`. So we need to convert form `jstring` to `char*` (point (a)). Following that, `os::dll_load()` that actually performs loading in a platform-specific manner also receives `char*`. All platform implementations of `os::dll_load()` pass the path name down to their respective platform's APIs unmodified, but I think that's just incidental and here we have another possible point of conversion (point (b)). Other consumers of the path name are exception(c) and logging(d) messages; they also take `char*`, but potentially of a different encoding.
>> 
>> Let me try to enumerate all conceivably valid conversions for `JVM_LoadLibrary()` consumption (point (a)):
>> 1. jstring -> platform-specific encoding (status quo meaning possibly lossy encoding on Windows and UTF-8 elsewhere AFAICT),
>> 2. jstring -> modified UTF-8,
>> 3. jstring -> UTF-8.
>> 
>> This bug [8195129](https://bugs.openjdk.java.net/browse/JDK-8195129) occurs because conversion (1) may loose information on Windows if the platform encoding happens to be NOT UTF-8 (which it often - or even always - is). So that's a no-go and we are left with either (2) or (3).
>> 
>> On MacOS and Linux, "platform" encoding already is UTF-8 and since all the platform APIs happily consume UTF-8, no further conversion is necessary (neither for actual library loading, nor for log or exception messages; the latter have to convert to UTF-16, but do that under the hood).
>> 
>> On Windows, we require at least these variants of the path name:
>> 1. UTF16 for library loading (Unicode Windows API),
>> 2. "platform" encoding for logging (yes, loosing information here, but that's tolerable),
>> 3. "platform" (lossy) or UTF8 (lossless) encoding for exception messages (prefer lossless).
>> 
>> This is what's behind my choice of UTF-8 for the path name encoding as it gets passed down to `JVM_LoadLibrary()`. We can go with modified UTF-8, of course, in which case all platforms - not just Windows - will have to do the conversion on their own, loosing the benefit of the knowledge about the original string encoding (the String.coder field of jstring).
>
> @mkartashev  thank you for the detailed explanation.
> 
> It is not clear to me that the JDK's conformance to being a Unicode application has significantly changed since the evaluation of JDK-8017274 - @naotoj  can you comment on that and related discussion from the CCC for JDK-4958170 ? In particular I'm not sure that using the platform encoding is wrong, nor how we can have a path that cannot be represented by the platform encoding?
> 
> Not being an expert in this area I cannot evaluate the affects of these shared code changes on other platforms, and so am reluctant to introduce any change that affects any non-Windows platforms. Also the JVM and JNI work with modified-UTF8 so I do not think we should diverge from that.
> I would hate to see windows specific code introduced into the JDK or JVM's shared code for these APIs, but that may be the only choice to avoid potential disruption to other platforms. Though perhaps we could push the initial conversion down into the JVM?

> I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8, 

AFAICT all platforms except Windows already use standard UTF-8 on that path (from `Java_jdk_internal_loader_NativeLibraries_load()` to `JVM_LoadLibrary()`) because the "platform" encoding for those happens to be "UTF-8". So at the current stage this patch actually maintains status quo for all platforms except Windows, the only platform where the bug exists.

But I am not against changing the encoding to modified UTF-8 and updating os::dll_load() for all platforms. Just wanted to have some consensus before proceeding with that change.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4169