[foreign-memaccess+abi] RFR: 8308858: FFM API and strings [v2]

Wed Jun 7 18:18:38 UTC 2023

On Wed, 7 Jun 2023 14:56:17 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.
>> 
>> As a result, this patch adds a Charset-accepting variant to `MemorySegment::getString`, `MemorySegment::setString` and `MemorySegment::allocateString` (the methods have also been renamed to drop their `Utf8` suffix).
>> 
>> However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support *any* charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.
>> 
>> In C (and C++), strings are either expressed as `char*` or `wchar_t*`. In the former case, no matter the encoding, the terminator char is expected to be `0x00` (e.g. the size of a `char` value), whereas in the latter case the terminator is expected to be `0x0000` or `0x0000000000` (the size of a `wchar_t` value, depending on platform).
>> 
>> Moreover, C more or less requires that a string, whether expressed as `char*` or `wchar_t*` cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must *not* use the reserved terminator values.
>> 
>> There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's `flush` method. As a result, we can only really work with charsets we know of (and the list in `StandardCharsets` seems a good starting set).
>> 
>> The subject of strings is quite tricky, and we noticed that other frameworks tend to get it [wrong](https://github.com/jnr/jnr-ffi/blob/master/src/main/java/jnr/ffi/util/BufferUtil.java#L63), often assuming that a wide string only has a single byte terminator.
>> 
>> An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.
>> 
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
>> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)
>> 
>> Anyway, I thought it would be a good idea to propose this PR and see what kind ...
>
> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address review comments

I think the second option (this PR) is the most attractive user model, since we do all the work, though we limit the number of Charsets if we do that. 

I think the third option is the most flexible if we want to support any Charset, but I'm a little disappointed we end up with potential rough edges... (who knows what kind of bug report we might end up with a few years down the line when somebody uses their custom charset). I also feel like we essentially kick the problem back to the user, since they now have to figure out the number of null terminator bytes. At that point, I think we're better off just letting the user implement the whole thing, since it's not that complicated (which has been the story so far for Charsets other than UTF8). At the same time, we can keep the 'nice' API (this PR) for the Charsets that we know how to do all the work for.

Maybe a fourth option is:
- avoid dealing with the terminator altogether. That would mean that, for getString, the user needs to give us an explicit size, in bytes, of the string. For setString, the user would have to insert manually using another call the null terminator (whatever that may be).

This is also kicking the problem back to the user.

I think out of all of these options, I still prefer the approach taken by this PR.

-------------

PR Comment: https://git.openjdk.org/panama-foreign/pull/836#issuecomment-1581296408