[foreign-memaccess+abi] RFR: 8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken [v3]

Tue Jun 15 15:33:20 UTC 2021

On Tue, 15 Jun 2021 15:12:16 GMT, Jorn Vernee <jvernee at openjdk.org> wrote:

>> The problem is that we only add a single 0 byte as a null terminator, regardless of the charset used. For wider char sets, more 0 bytes need to be added. For instance, for UTF_16LE two 0 bytes need to be added.
>> 
>> This patch fixes the issue by adding the null terminator to the Java string, and only then encoding it as a `byte[]`.
>
> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix tests after merge

When looking into toJavaString, the problems turned out to be much greater. The problem essentially boils down to not having a `strlen` function for any arbitrary `Charset`, so we can only support a certain subset of `Charsets` for which we know our current way of determining the native string's length works.

We discussed this at length among the team members, and arrived at the intermediate conclusion to only support the 'platform native' Charset. Though, this turns out to be tricky as well, as this charset, which is also called the 'execution character set' in C lingo, is determined based on a compiler setting at build time of the native code. With GCC and Clang the default is UTF-8, while on Windows it depends on the current code page. While there is a way to get the current code page of the runtime system and determine the character set from that, we would not be able to avoid issues with code page mismatches between the build environment and runtime environment on Windows,

While it would still technically be possible to support different character sets as long as they work with `strlen`, at present there is no way to detect this for an arbitrary character set. So, if we kept the `Charset` parameter, we would not be able to sanity check it, which doesn't seem great either.

As a result of all this, for now we have arrived at the decision to only support the UTF-8 Charset for the toCString and toJavaString methods, and to leave encoding and decoding using other character sets (including determining the length of a native string) to be implemented manually.

I've updated this PR to remove the overloads that accept a `Charset`, and updated the implementation to always use UTF-8. I've added several test cases as well that test Unicode characters that get encoded with different amounts of bytes in UTF-8.

Notice that the prime focus for this patch is stabilization (for JDK 17 as well). Perhaps in the future these APIs could be expanded to support more character sets again.

test/jdk/java/foreign/TestStringEncoding.java line 60:

> 58:             { "yen \u00A5",            7 }, // in UTF-8 2 bytes: 0xC2 0xA5
> 59:             { "snowman \u26C4",       12 }, // in UTF-8 three bytes: 0xE2 0x9B 0x84
> 60:             { "rainbow \uD83C\uDF08", 13 }  // int UTF-8 four bytes: 0xF0 0x9F 0x8C 0x88

Suggestion:

            { "rainbow \uD83C\uDF08", 13 }  // in UTF-8 four bytes: 0xF0 0x9F 0x8C 0x88

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/554