RFR: 8195129: System.load() fails to load from unicode paths [v3]
Naoto Sato
naoto at openjdk.java.net
Mon Jun 7 18:49:20 UTC 2021
On Sun, 6 Jun 2021 22:25:44 GMT, David Holmes <dholmes at openjdk.org> wrote:
>> I think we need to establish some common ground before proceeding further with this fix. It's a bit of a long read; please, bear with me.
>>
>> The path name starts its life as a `jstring` in `Java_jdk_internal_loader_NativeLibraries_load()`, its encoding is irrelevant at this point.
>>
>> Next, the name has to be passed down to `JVM_LoadLibrary()` that takes `char*`. So we need to convert form `jstring` to `char*` (point (a)). Following that, `os::dll_load()` that actually performs loading in a platform-specific manner also receives `char*`. All platform implementations of `os::dll_load()` pass the path name down to their respective platform's APIs unmodified, but I think that's just incidental and here we have another possible point of conversion (point (b)). Other consumers of the path name are exception(c) and logging(d) messages; they also take `char*`, but potentially of a different encoding.
>>
>> Let me try to enumerate all conceivably valid conversions for `JVM_LoadLibrary()` consumption (point (a)):
>> 1. jstring -> platform-specific encoding (status quo meaning possibly lossy encoding on Windows and UTF-8 elsewhere AFAICT),
>> 2. jstring -> modified UTF-8,
>> 3. jstring -> UTF-8.
>>
>> This bug [8195129](https://bugs.openjdk.java.net/browse/JDK-8195129) occurs because conversion (1) may loose information on Windows if the platform encoding happens to be NOT UTF-8 (which it often - or even always - is). So that's a no-go and we are left with either (2) or (3).
>>
>> On MacOS and Linux, "platform" encoding already is UTF-8 and since all the platform APIs happily consume UTF-8, no further conversion is necessary (neither for actual library loading, nor for log or exception messages; the latter have to convert to UTF-16, but do that under the hood).
>>
>> On Windows, we require at least these variants of the path name:
>> 1. UTF16 for library loading (Unicode Windows API),
>> 2. "platform" encoding for logging (yes, loosing information here, but that's tolerable),
>> 3. "platform" (lossy) or UTF8 (lossless) encoding for exception messages (prefer lossless).
>>
>> This is what's behind my choice of UTF-8 for the path name encoding as it gets passed down to `JVM_LoadLibrary()`. We can go with modified UTF-8, of course, in which case all platforms - not just Windows - will have to do the conversion on their own, loosing the benefit of the knowledge about the original string encoding (the String.coder field of jstring).
>
> @mkartashev thank you for the detailed explanation.
>
> It is not clear to me that the JDK's conformance to being a Unicode application has significantly changed since the evaluation of JDK-8017274 - @naotoj can you comment on that and related discussion from the CCC for JDK-4958170 ? In particular I'm not sure that using the platform encoding is wrong, nor how we can have a path that cannot be represented by the platform encoding?
>
> Not being an expert in this area I cannot evaluate the affects of these shared code changes on other platforms, and so am reluctant to introduce any change that affects any non-Windows platforms. Also the JVM and JNI work with modified-UTF8 so I do not think we should diverge from that.
> I would hate to see windows specific code introduced into the JDK or JVM's shared code for these APIs, but that may be the only choice to avoid potential disruption to other platforms. Though perhaps we could push the initial conversion down into the JVM?
@dholmes-ora Sorry, I don't think anything has changed as to the encoding as of JDK-8017274. For some reason, I had the impression that JVM_LoadLibrary() accepts UTF-8 (either modified or standard), but that was not correct. It is using the platform encoded string for the pathname.
@mkartashev As you mentioned in another comment, the only way to fix this issue is to pass UTF-8 down to JVM_LoadLibray, but I don't think it is feasible. One reason is the effort is too great, and the other is that all VM implementations would need to be modified.
-------------
PR: https://git.openjdk.java.net/jdk/pull/4169
More information about the core-libs-dev
mailing list