[foreign-memaccess+abi] RFR: 8308858: FFM API and strings

Fri Jun 2 19:08:32 UTC 2023

On Fri, 2 Jun 2023 18:25:33 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.
> 
> As a result, this patch adds a Charset-accepting variant to `MemorySegment::getString`, `MemorySegment::setString` and `MemorySegment::allocateString` (the methods have also been renamed to drop their `Utf8` suffix).
> 
> However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support *any* charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.
> 
> In C (and C++), strings are either expressed as `char*` or `wchar_t*`. In the former case, no matter the encoding, the terminator char is expected to be `0x00` (e.g. the size of a `char` value), whereas in the latter case the terminator is expected to be `0x0000` or `0x0000000000` (the size of a `wchar_t` value, depending on platform).
> 
> Moreover, C more or less requires that a string, whether expressed as `char*` or `wchar_t*` cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must *not* use the reserved terminator values.
> 
> There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's `flush` method. As a result, we can only really work with charsets we know of (and the list in `StandardCharsets` seems a good starting set).
> 
> The subject of strings is quite tricky, and we noticed that other frameworks tend to get it [wrong](https://github.com/jnr/jnr-ffi/blob/master/src/main/java/jnr/ffi/util/BufferUtil.java#L63), often assuming that a wide string only has a single byte terminator.
> 
> An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.
> 
> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
> https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)
> 
> Anyway, I thought it would be a good idea to propose this PR and see what kind of feedback it generates :-)

The use of `@ForceInline` should be removed I think. Since it messes with the JIT's own heuristics, it should be avoided in general, unless we know that the contents of a method is likely to fold up to almost nothing (such as for MS::get*/set*). However, that is not the case here. Better to not mess with it and let the JIT, which has more information, make the decision IMO. (at the end of the day, we bottom out into an out-of-line call to `Unsafe::copyMemory` any way)

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java line 105:

> 103:      *     <li>{@code B} is the size, in bytes, of the string encoded using the provided charset
> 104:      *     (e.g. {@code str.getBytes(charset).length});</li>
> 105:      *     <li>{@code P} is the size (in bytes) of the terminator char according to the provided charset. For instance,

I guess you mean `N` here?
Suggestion:

     *     <li>{@code H} is the size (in bytes) of the terminator char according to the provided charset. For instance,

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java line 120:

> 118:         if (termCharSize == -1) {
> 119:             throw new UnsupportedOperationException("Unsupported charset: " + charset);
> 120:         }

This check seems unnecessary since `CharsetKind::of` should already fail for an unsupported charset.

src/java.base/share/classes/jdk/internal/foreign/StringSupport.java line 1:

> 1: package jdk.internal.foreign;

Missing a copyright header.

-------------

PR Review: https://git.openjdk.org/panama-foreign/pull/836#pullrequestreview-1458298994
PR Review Comment: https://git.openjdk.org/panama-foreign/pull/836#discussion_r1214699419
PR Review Comment: https://git.openjdk.org/panama-foreign/pull/836#discussion_r1214708772
PR Review Comment: https://git.openjdk.org/panama-foreign/pull/836#discussion_r1214700517