[foreign-memaccess+abi] RFR: 8308858: FFM API and strings [v2]

Wed Jun 7 15:33:12 UTC 2023

On Wed, 7 Jun 2023 14:56:17 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.
>> 
>> As a result, this patch adds a Charset-accepting variant to `MemorySegment::getString`, `MemorySegment::setString` and `MemorySegment::allocateString` (the methods have also been renamed to drop their `Utf8` suffix).
>> 
>> However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support *any* charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.
>> 
>> In C (and C++), strings are either expressed as `char*` or `wchar_t*`. In the former case, no matter the encoding, the terminator char is expected to be `0x00` (e.g. the size of a `char` value), whereas in the latter case the terminator is expected to be `0x0000` or `0x0000000000` (the size of a `wchar_t` value, depending on platform).
>> 
>> Moreover, C more or less requires that a string, whether expressed as `char*` or `wchar_t*` cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must *not* use the reserved terminator values.
>> 
>> There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's `flush` method. As a result, we can only really work with charsets we know of (and the list in `StandardCharsets` seems a good starting set).
>> 
>> The subject of strings is quite tricky, and we noticed that other frameworks tend to get it [wrong](https://github.com/jnr/jnr-ffi/blob/master/src/main/java/jnr/ffi/util/BufferUtil.java#L63), often assuming that a wide string only has a single byte terminator.
>> 
>> An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.
>> 
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)
>> 
>> Anyway, I thought it would be a good idea to propose this PR and see what kind ...
>
> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address review comments

> > Moreover, C more or less requires that a string, whether expressed as char* or wchar_t* cannot contain any element in the array that is a terminator char.
> 
> Why not then using two method:
> 
>     * get/set`CString`
> 
>     * get/set `WString`
> 
> 
> that both accept a generic char-set and let the user choose a suitable one as the _name_ of the charset do not really matter much (as long as both sides use the same byte -> char).

Not sure about this. Yes, we could have a getSingleByteString and getWideString. But it we also support charsets, then we have to worry about the compatibility of the provided charset with the given string method. E.g. what if I pass UTF-8 charset to getWideString? Or Utf-16 charset to getSingleByteString? The approach described here has a single parameter, which controls everything else.

-------------

PR Comment: https://git.openjdk.org/panama-foreign/pull/836#issuecomment-1581065831