RFR: 8369564: Provide a MemorySegment API to read strings with known lengths

Wed Oct 29 15:43:25 UTC 2025

On Wed, 29 Oct 2025 10:50:53 GMT, Liam Miller-Cushon <cushon at openjdk.org> wrote:

> This PR proposes adding a new overload to `MemorySegment::getString` that takes a known byte length of the content.
> 
> This was previously proposed in https://github.com/openjdk/jdk/pull/20725, but the outcome of [JDK-8333843](https://bugs.openjdk.org/browse/JDK-8333843) was to update `MemorySegment#getString` to suggest
> 
> 
>     byte[] bytes = new byte[length];
>     MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length);
>     return new String(bytes, charset);
> 
> 
> However this is less efficient than what the implementation of getString does after [JDK-8362893](https://bugs.openjdk.org/browse/JDK-8362893), it now uses `JavaLangAccess::uncheckedNewStringNoRepl` to avoid the copy.

Some preliminary comments. I didn't look at the tests yet.

src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1338:

> 1336:      *                access operation will occur
> 1337:      * @param length  byte length to be used for string conversion (not including any
> 1338:      *                null termination)

I think 'to be used for string conversion' is a bit too vague (used _how_?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes').

Also, what happens if:
- The length _does_ include a null terminator
- The length is not a multiple of the byte size of a character in the given charset.

On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size.

src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1349:

> 1347:      *         <ul>
> 1348:      *             <li>{@code B} is the size, in bytes, of the string encoded using the
> 1349:      *             provided charset (e.g. {@code str.getBytes(charset).length});</li>

Isn't `B` equal to the `length` argument?

src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1351:

> 1349:      *             provided charset (e.g. {@code str.getBytes(charset).length});</li>
> 1350:      *             <li>{@code N} is the size (in bytes) of the terminator char according
> 1351:      *             to the provided charset. For instance, this is 1 for

Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either.

src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1363:

> 1361:      */
> 1362:     String getString(long offset, int length, Charset charset);
> 1363: 

I'd suggest putting the `length` parameter at the end, so that this becomes a telescoping overload of the length-less variant.

src/java.base/share/classes/jdk/internal/foreign/StringSupport.java line 66:

> 64:             case SINGLE_BYTE -> readByte(segment, offset, len, charset);
> 65:             case DOUBLE_BYTE -> readShort(segment, offset, len, charset);
> 66:             case QUAD_BYTE -> readInt(segment, offset, len, charset);

These 3 methods appear to be identical

-------------

PR Review: https://git.openjdk.org/jdk/pull/28043#pullrequestreview-3394314370
PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473855397
PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473837599
PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473832085
PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473823655
PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473819485