[foreign-memaccess+abi] RFR: 8308858: FFM API and strings [v2]

Thu Jun 8 10:17:15 UTC 2023

On Wed, 7 Jun 2023 14:56:17 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.
>> 
>> As a result, this patch adds a Charset-accepting variant to `MemorySegment::getString`, `MemorySegment::setString` and `MemorySegment::allocateString` (the methods have also been renamed to drop their `Utf8` suffix).
>> 
>> However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support *any* charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.
>> 
>> In C (and C++), strings are either expressed as `char*` or `wchar_t*`. In the former case, no matter the encoding, the terminator char is expected to be `0x00` (e.g. the size of a `char` value), whereas in the latter case the terminator is expected to be `0x0000` or `0x0000000000` (the size of a `wchar_t` value, depending on platform).
>> 
>> Moreover, C more or less requires that a string, whether expressed as `char*` or `wchar_t*` cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must *not* use the reserved terminator values.
>> 
>> There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's `flush` method. As a result, we can only really work with charsets we know of (and the list in `StandardCharsets` seems a good starting set).
>> 
>> The subject of strings is quite tricky, and we noticed that other frameworks tend to get it [wrong](https://github.com/jnr/jnr-ffi/blob/master/src/main/java/jnr/ffi/util/BufferUtil.java#L63), often assuming that a wide string only has a single byte terminator.
>> 
>> An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.
>> 
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)
>> 
>> Anyway, I thought it would be a good idea to propose this PR and see what kind ...
>
> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address review comments

At least ASCII and the Windows-1252 are quite common in C so maybe UTF-8 is superior but not always desired? Beside that there is also "Pascal" encoding with 8-bit and 16-bit encoded :-D

-------------

PR Comment: https://git.openjdk.org/panama-foreign/pull/836#issuecomment-1582303321