RFR: 8240197: Cannot start JVM when $JAVA_HOME includes CJK characters

Fri Feb 28 07:39:08 UTC 2020

Hi Thomas,

Thanks for your review!
I refactored entire of this function as result :)

Please review again this webrev:

   http://cr.openjdk.java.net/~ysuenaga/JDK-8240197/webrev.01/

On 2020/02/28 14:43, Thomas Stüfe wrote:
> Hi Yasumasa,
> 
> thanks for fixing this. CC'ing Ralf since this was his change.
> 
> -----
> 
> Not sure MB_ERR_INVALID_CHARS with a subsequent return NULL is a good option for conversion errors? Would it not better to use the default behavior of replacing invalid code points with U+FFFD?

I think we should use MB_ERR_INVALID_CHARS at this point because invalid chars should not be used in the path.
If invalid char(s) are converted to special char (e.g. U+FFFD), it is not mapped to original path.

> -----
> 
> Can the result buffer overflow? UTF16 is a variable sized char set, code points can overflow into the upper plane and use 32bit. Result buffer size is calculated in terms of sizeof(wchar_t). Not sure what size wchar_t is; if it is 16bit the result buffer could be too small. If it is 32bit I'd be really confused too :) since we would have had 32bit-wide characters before and now we deal with UTF16?
> 
> Should it be too small, there are two solutions:
> 
> 1) either over-allocate for the worst case: for each input char allocate space for two 16bit code points.
> 2) or, call MultiByteToWideChar twice, first time with 0 as output buffer size. Then, function just returns output buffer size, which you can allocate and repeat the call.
> 
> I would at least like to see an explicit assert as toward (sizeof(uint_16) * 2 * num output chars) < result buffer size.

We can get required buffer size when we pass zero to cchWideChar.

   MultiByteToWideChar (Microsoft Docs)
     https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar

New webrev passed zero to cchWideChar at first, and it would allocate buffer with its value.
So buffer overflow would not occur.

> -----
> 
> Come to think of it, the intermixing of wchar_t* with LPWSTR is a bit confusing. It feeds a wchar_t* as result buffer to MultiByteToWideChar. Can you make path_start a LPWSTR and cast it explicitely?

I changed wchar_t to WCHAR, and wchar_t* to LPWSTR.

> ----
> 
> I assume you call ZeroMemory to make the result string zero terminated? A cheaper and simpler way would be to either pass -1 as input buffer length or to fix up the terminating zero after the call. I'd probably do the former.

If we pass -1 to cbMultiByte, MultiByteToWideChar() processes the entire input string, including the terminating null character (See Microsoft Doc).
So I use it in new webrev.

> ----
> 
> Not a must do, but regression tests would be nice. Easiest thing I could think of would be testing wide_abs_unc_path() in a gtest with some CJK literals. Now that I think of it, regression tests may be good for this function in general.
> 
> For that wide_abs_unc_path would have to be exported (the static removed) and declared as a prototype in the associated gtest, if we do not want to pollute a header with it.

It is nice idea, but MultiByteToWideChar() would perform with available locale. If we check wide_abs_unc_path() with Japanese characters, Japanese language need to be installed (See Microsoft Doc).

So it is difficult to test on all platform.

Thanks,

Yasumasa

> -----
> 
> Cheers, Thomas
> 
> 
> On Fri, Feb 28, 2020 at 2:55 AM Yasumasa Suenaga <suenaga at oss.nttdata.com <mailto:suenaga at oss.nttdata.com>> wrote:
> 
>     Hi all,
> 
>     Please review this change:
> 
>         JBS: https://bugs.openjdk.java.net/browse/JDK-8240197
>         webrev: http://cr.openjdk.java.net/~ysuenaga/JDK-8240197/webrev.00/
> 
>     JVM cannot start as below when CJK characters are included in $JAVA_HOME.
> 
>     ```
>     PS > .\build\windows-x86_64-server-fastdebug\images\jdk日本語\bin\java.exe --version
>     Error occurred during initialization of VM
>     Failed setting boot class path.
>     ```
> 
>     I think this bug has been occurred since JDK-8191521.
>     It uses mbstowcs_s() to convert char* to wchar_t*. But Windows API for wchar_t* (they are named *W()) requires Unicode chars for their arguments. mbstowcs_s() would not convert to Unicode. We need to use MultiByteToWideChar().
> 
>     This issue has been also confirmed in AdoptOpenJDK 11.0.6 [1]. So we also need to change jdk11u.
> 
>     I tested this change on submit repo, then it was failed (mach5-one-ysuenaga-JDK-8240197-20200228-0138-9049982).
>     However the error was not caused by this change because it was occurred in all platforms in spite of I changed for Windows only.
>     (Of course, it works well on my Windows 10 laptop.)
> 
>     I'm happy if you help to investigate cause of error.
> 
> 
>     Thanks,
> 
>     Yasumasa
> 
> 
>     [1] https://github.com/AdoptOpenJDK/openjdk-build/issues/1496
>