[foreign-memaccess+abi] RFR: 8308858: FFM API and strings [v3]

Tue Jun 13 17:16:34 UTC 2023

On Tue, 13 Jun 2023 16:59:38 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.
>> 
>> As a result, this patch adds a Charset-accepting variant to `MemorySegment::getString`, `MemorySegment::setString` and `MemorySegment::allocateString` (the methods have also been renamed to drop their `Utf8` suffix).
>> 
>> However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support *any* charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.
>> 
>> In C (and C++), strings are either expressed as `char*` or `wchar_t*`. In the former case, no matter the encoding, the terminator char is expected to be `0x00` (e.g. the size of a `char` value), whereas in the latter case the terminator is expected to be `0x0000` or `0x0000000000` (the size of a `wchar_t` value, depending on platform).
>> 
>> Moreover, C more or less requires that a string, whether expressed as `char*` or `wchar_t*` cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must *not* use the reserved terminator values.
>> 
>> There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's `flush` method. As a result, we can only really work with charsets we know of (and the list in `StandardCharsets` seems a good starting set).
>> 
>> The subject of strings is quite tricky, and we noticed that other frameworks tend to get it [wrong](https://github.com/jnr/jnr-ffi/blob/master/src/main/java/jnr/ffi/util/BufferUtil.java#L63), often assuming that a wide string only has a single byte terminator.
>> 
>> An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.
>> 
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)
>> 
>> Anyway, I thought it would be a good idea to propose this PR and see what kind ...
>
> Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
> 
>  - Merge branch 'foreign-memaccess+abi' into strings
>  - * make UTF-8 the default
>    * make Charset parameter be the last
>  - Merge branch 'master' into strings
>  - Address review comments
>  - Remove unused code
>    Cleanup code
>  - Initial push

Following some offline discussions, I have decided to:
* keep the general shape of the PR, as that gives us most flexibility (while getString/getWideString is a possibility, we did not like that getString would need a platform-dependent semantics, based on the selected native encoder, which is something the JDK as a whole is trying to get away from)
* Use UTF-8 instead of Charset.defaultCharset for getString/setString/allocateString
* Tweak API so that Charset parameter is the last

We will consider whether to add support for UTF32 in the StandardCharsets class as a separate change (which will allow wide string on *nix platforms).

-------------

PR Comment: https://git.openjdk.org/panama-foreign/pull/836#issuecomment-1589714722