MemorySegment APIs for reading and writing strings with known lengths
Jorn Vernee
jorn.vernee at oracle.com
Mon Nov 3 18:47:37 UTC 2025
Hey Liam,
Thanks for opening this discussion. I have had similar thoughts about
getString in the past. i.e. what if you already know the length of the
string? Figuring out the string's length based on the null terminator is
redundant. Also, as you say, there may not be a null terminator in the
first place. And finally, since getString already accepts a byte offset
to the start of the string, adding a length parameter would allow users
to read sub strings from memory segments as well.
About the copy elision, this is something that we have seen being
visible in benchmarks [1]. Were there any benchmarks in which you've
seen this difference show up as well, or is this more of a theoretical
benefit? It would be good to understand how important performance is in
all of this, or if it's more about API usability. Also, I should note
that in the case of setString, we only avoid the extra byte[] when the
source and target encodings are compatible.
With regards to setString, as you say, the most important part is that
we allow writing a string without a null terminator. Even if the method
accepted an explicit length, I think we'd need an offset as well, so a
user could write sub strings. It's possible to call `getBytes` on a
string and copy the resulting byte[] into the memory segment as well,
but I suppose you want to avoid that because of the extra copy of the
byte[] (although I think perhaps C2 can elide the extra object in that
case)? Have you tried looping over the string and manually copying each
character using charAt?
All in all though, I have a bit of a feeling that there's a missing API
primitive somewhere. I was thinking that maybe we can add a way to
create a (read-only) MemorySegment view of a String (e.g.
MemorySegment::ofString), but the internal encoding of String is not
part of the API, so what is the encoding of the string in the memory
segment? JNI's GetStringCritical deals with this by inflating the string
to UTF16 if a string is not in that encoding already. Alternatively, we
could let the user specify a target CharSet, and then we elide a copy if
the encodings are compatible, but that would lead to potentially
unpredictable performance. Also, if you're in the copying case, you'd
really like the string to be written into a memory segment directly with
the right encoding, rather than having an intermediate byte[] again.
I think I'm slowly coming to the conclusion that we should just treat
Strings as another source and destination format for data, with the
caveat that we can not modify a String in place, so any read operations
will have to create a new String instance instead. This creates some
asymmetry with the existing MemorySegment::copy methods. I think because
of that restriction, we might have to accept some asymmetry between the
read and write APIs for strings as well.
Jorn
[1]: https://github.com/openjdk/jdk/pull/26493
On 3-11-2025 18:57, Liam Miller-Cushon wrote:
> Hello,
>
> I wanted to share some background on a use-case for FFM that could be
> better supported by MemorySegment.
>
> This is related to the draft PR in
> https://github.com/openjdk/jdk/pull/28043, and the FRs in
> https://bugs.openjdk.org/browse/JDK-8369564,
> https://bugs.openjdk.org/browse/JDK-8370882.
>
> The use-case is reading and writing strings, where the strings have
> known lengths, and where the contents may contain \0. \0 is a legal
> value that may appear within user data, which the specification for
> getString acknowledges, but doesn't provide support for working with
> those strings. Additionally, data on the native heap may not use \0
> terminators.
>
> As a concrete example, when serializing and deserializing protocol
> buffers, the binary format encodes strings as a varint encoded length
> (in UTF-8 code unit units [1]), followed by the string data. Also,
> when parsing input, protobuf will use references to string data in the
> original input buffer which didn't contain a trailing \0.
>
> [1] https://www.unicode.org/glossary/#code_unit
>
> Another example of data on the native heap that may not use \0 is
> std::string_view in C++, which is necessarily not null-terminated; by
> design with string_views you can have one large string and
> string_views which point to ranges of that, which means they can't
> have a \0 as the next character after (next byte is still other
> content of the string).
>
> There are some challenges with using the existing MemorySegment APIs
> for these use-cases.
>
> MemorySegment#getString and MemorySegment#setString assume
> null-terminated strings. If the string data does contain \0, the
> getString javadoc specifies [2] that it won't read the entire string.
> When using setString, it will always write the null terminator, which
> there may not be room for if the output buffer is exactly sized and
> the string is being written up to the end of the buffer.
>
> [2]
> https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/lang/foreign/MemorySegment.html#getString(long)
>
> MemorySegment#copy can be used to express the desired behaviour for
> strings of known length that may contain \0, but it doesn't provide
> equivalent performance to getString and setString. After
> https://bugs.openjdk.org/browse/JDK-8362893 getString can use
> JavaLangAccess#uncheckedNewStringOrThrow to avoid a copy (which C2 may
> be able to avoid in the future if
> https://bugs.openjdk.org/browse/JDK-8364418 is implemented, but is not
> yet possible). setString is able to use JavaLangAccess#bytesCompatible
> and JavaLangAccess#copyToSegmentRaw to elide copies in some cases,
> compared to using String#getBytes and MemorySegment#copy.
>
> The straw person alternative I would like to propose is to provide a
> method like getString that takes an explicit length and doesn't write
> a null terminator, and a method like setString that writes the string
> without an explicit null terminator.
>
> For the getString method, the draft in
> https://github.com/openjdk/jdk/pull/28043 proposes getString(long
> offset, Charset charset, int length), where the length is a length in
> code units. So for UTF-8 that would be a length in 8 bit code units.
>
> For setString, one approach would be to also have an explicit length,
> but this seems undesirable for a few reasons. Perhaps the API would
> allow setting the length to the exact length of the string would omit
> the null terminator, and length + terminator size would include a
> terminator. However that seems subtle and could be a footgun, and as
> well I don't have a use-case for writing a smaller substring.
> Additionally, computing the encoded length in bytes for UTF-8 requires
> computation and adds performance overhead, callers would have to make
> a pass over the data to compute the length before calling the API.
>
> Instead of passing a length to setString, it might make more sense to
> allow requesting that the null terminator is omitted, for example a
> `int setStringWithoutTerminator(String s, long offset)`, that returned
> the number of bytes or code units that were written (*). (I am not
> proposing a specific name or shape for that API, the straw person
> examples are just to illustrate the idea.)
>
> (*) The API returns the length of bytes written, because it requires a
> pass over the data to compute which takes time, and is nontrivial to
> compute correctly because it depends on handling of e.g. replacement
> characters. This issue is somewhat separate from the other topics
> here, and the existing setString methods don't do this, so there's a
> tradeoff between consistency and the additional functionality.
>
> I would appreciate any thoughts on this, and how best to handle this
> use-case in the FFM APIs.
>
> Thanks,
> Liam
More information about the panama-dev
mailing list