[foreign-memaccess+abi] RFR: 8308858: FFM API and strings

Fri Jun 2 18:32:31 UTC 2023

This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.

As a result, this patch adds a Charset-accepting variant to `MemorySegment::getString`, `MemorySegment::setString` and `MemorySegment::allocateString` (the methods have also been renamed to drop their `Utf8` suffix).

However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support *any* charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.

In C (and C++), strings are either expressed as `char*` or `wchar_t*`. In the former case, no matter the encoding, the terminator char is expected to be `0x00` (e.g. the size of a `char` value), whereas in the latter case the terminator is expected to be `0x0000` or `0x0000000000` (the size of a `wchar_t` value, depending on platform).

Moreover, C more or less requires that a string, whether expressed as `char*` or `wchar_t*` cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must *not* use the reserved terminator values.

There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's `flush` method. As a result, we can only really work with charsets we know of (and the list in `StandardCharsets` seems a good starting set).

The subject of strings is quite tricky, and we noticed that other frameworks tend to get it [wrong](https://github.com/jnr/jnr-ffi/blob/master/src/main/java/jnr/ffi/util/BufferUtil.java#L63), often assuming that a wide string only has a single byte terminator.

An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.

https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)

Anyway, I thought it would be a good idea to propose this PR and see what kind of feedback it generates :-)

-------------

Commit messages:
 - Remove unused code
 - Initial push

Changes: https://git.openjdk.org/panama-foreign/pull/836/files
 Webrev: https://webrevs.openjdk.org/?repo=panama-foreign&pr=836&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8308858
  Stats: 345 lines in 14 files changed: 246 ins; 38 del; 61 mod
  Patch: https://git.openjdk.org/panama-foreign/pull/836.diff
  Fetch: git fetch https://git.openjdk.org/panama-foreign.git pull/836/head:pull/836

PR: https://git.openjdk.org/panama-foreign/pull/836