MemorySegment APIs for reading and writing strings with known lengths
Liam Miller-Cushon
cushon at google.com
Mon Nov 3 17:57:30 UTC 2025
Hello,
I wanted to share some background on a use-case for FFM that could be
better supported by MemorySegment.
This is related to the draft PR in https://github.com/openjdk/jdk/pull/28043,
and the FRs in https://bugs.openjdk.org/browse/JDK-8369564,
https://bugs.openjdk.org/browse/JDK-8370882.
The use-case is reading and writing strings, where the strings have known
lengths, and where the contents may contain \0. \0 is a legal value that
may appear within user data, which the specification for getString
acknowledges, but doesn't provide support for working with those strings.
Additionally, data on the native heap may not use \0 terminators.
As a concrete example, when serializing and deserializing protocol buffers,
the binary format encodes strings as a varint encoded length (in UTF-8 code
unit units [1]), followed by the string data. Also, when parsing input,
protobuf will use references to string data in the original input buffer
which didn't contain a trailing \0.
[1] https://www.unicode.org/glossary/#code_unit
Another example of data on the native heap that may not use \0 is
std::string_view in C++, which is necessarily not null-terminated; by
design with string_views you can have one large string and string_views
which point to ranges of that, which means they can't have a \0 as the next
character after (next byte is still other content of the string).
There are some challenges with using the existing MemorySegment APIs for
these use-cases.
MemorySegment#getString and MemorySegment#setString assume null-terminated
strings. If the string data does contain \0, the getString javadoc
specifies [2] that it won't read the entire string. When using setString,
it will always write the null terminator, which there may not be room for
if the output buffer is exactly sized and the string is being written up to
the end of the buffer.
[2]
https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/lang/foreign/MemorySegment.html#getString(long)
MemorySegment#copy can be used to express the desired behaviour for strings
of known length that may contain \0, but it doesn't provide equivalent
performance to getString and setString. After
https://bugs.openjdk.org/browse/JDK-8362893 getString can use
JavaLangAccess#uncheckedNewStringOrThrow to avoid a copy (which C2 may be
able to avoid in the future if https://bugs.openjdk.org/browse/JDK-8364418
is implemented, but is not yet possible). setString is able to use
JavaLangAccess#bytesCompatible and JavaLangAccess#copyToSegmentRaw to elide
copies in some cases, compared to using String#getBytes and
MemorySegment#copy.
The straw person alternative I would like to propose is to provide a method
like getString that takes an explicit length and doesn't write a null
terminator, and a method like setString that writes the string without an
explicit null terminator.
For the getString method, the draft in
https://github.com/openjdk/jdk/pull/28043 proposes getString(long offset,
Charset charset, int length), where the length is a length in code units.
So for UTF-8 that would be a length in 8 bit code units.
For setString, one approach would be to also have an explicit length, but
this seems undesirable for a few reasons. Perhaps the API would allow
setting the length to the exact length of the string would omit the null
terminator, and length + terminator size would include a terminator.
However that seems subtle and could be a footgun, and as well I don't have
a use-case for writing a smaller substring. Additionally, computing the
encoded length in bytes for UTF-8 requires computation and adds performance
overhead, callers would have to make a pass over the data to compute the
length before calling the API.
Instead of passing a length to setString, it might make more sense to allow
requesting that the null terminator is omitted, for example a `int
setStringWithoutTerminator(String s, long offset)`, that returned the
number of bytes or code units that were written (*). (I am not proposing a
specific name or shape for that API, the straw person examples are just to
illustrate the idea.)
(*) The API returns the length of bytes written, because it requires a pass
over the data to compute which takes time, and is nontrivial to compute
correctly because it depends on handling of e.g. replacement characters.
This issue is somewhat separate from the other topics here, and the
existing setString methods don't do this, so there's a tradeoff between
consistency and the additional functionality.
I would appreciate any thoughts on this, and how best to handle this
use-case in the FFM APIs.
Thanks,
Liam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251103/66b97d59/attachment-0001.htm>
More information about the panama-dev
mailing list