[foreign-memaccess+abi] RFR: 8308858: FFM API and strings

Mon Jun 5 07:18:30 UTC 2023

On Fri, 2 Jun 2023 18:25:33 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.
> 
> As a result, this patch adds a Charset-accepting variant to `MemorySegment::getString`, `MemorySegment::setString` and `MemorySegment::allocateString` (the methods have also been renamed to drop their `Utf8` suffix).
> 
> However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support *any* charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.
> 
> In C (and C++), strings are either expressed as `char*` or `wchar_t*`. In the former case, no matter the encoding, the terminator char is expected to be `0x00` (e.g. the size of a `char` value), whereas in the latter case the terminator is expected to be `0x0000` or `0x0000000000` (the size of a `wchar_t` value, depending on platform).
> 
> Moreover, C more or less requires that a string, whether expressed as `char*` or `wchar_t*` cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must *not* use the reserved terminator values.
> 
> There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's `flush` method. As a result, we can only really work with charsets we know of (and the list in `StandardCharsets` seems a good starting set).
> 
> The subject of strings is quite tricky, and we noticed that other frameworks tend to get it [wrong](https://github.com/jnr/jnr-ffi/blob/master/src/main/java/jnr/ffi/util/BufferUtil.java#L63), often assuming that a wide string only has a single byte terminator.
> 
> An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.
> 
> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)
> 
> Anyway, I thought it would be a good idea to propose this PR and see what kind of feedback it generates :-)

Maybe the default charset will be the most likely to use and if so, we could provide some kind of optimized version (e.g. using a composed `MethodHandle`?) in the future. Perhaps it is good enough but it would be interesting to see some benchmarks.

-------------

PR Comment: https://git.openjdk.org/panama-foreign/pull/836#issuecomment-1576185386