[External] : Re: MemorySegment APIs for reading and writing strings with known lengths
Jonathan Rosenne
jr at qsm.co.il
Tue Nov 18 18:20:25 UTC 2025
Excuse my ignorance, but why can't fixed length C strings be treated as arrays of bytes? Java has all the necessary options to convert between byte arrays and strings.
Best Regards,
Jonathan Rosenne
From: panama-dev <panama-dev-retn at openjdk.org> On Behalf Of Jorn Vernee
Sent: Tuesday, November 18, 2025 6:52 PM
To: Liam Miller-Cushon <cushon at google.com>; Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
Cc: panama-dev at openjdk.org
Subject: Re: [External] : Re: MemorySegment APIs for reading and writing strings with known lengths
Coming back to this, I think we've settled on the following three methods:
In MemorySegment:
String getString(long offset, Charset charset, long length); // as in Liam's PR
void copy(String src, Charset dstEncoding, int srcIndex, MemorySegment dst, int numChars);
And in SegmentAllocator:
MemorySegment allocateFrom(String src, Charset dstEncoding, int srcIndex, int numChars);
For encoding directly into a memory segment without the need to go to an intermediate buffer, it looks like we can use the internal StringCharBuffer class, in combination with the `CharsetEncoder::encode` method. But of course we can skip encoding altogether when the internal string encoding matches the target, and just do a bulk copy.
For allocateFrom, since we don't yet have a way to determine the encoded length of a String, I think we'd still have to go to an intermediate byte[], and then allocate the result segment based on its length. We can still avoid the intermediate byte[] in most cases where the encoding of the String's internal buffer is compatible with the target encoding, and again just do a bulk copy from the string's internal buffer.
Note on the length parameter for getString: we thought that it might be possible to open this up to any charset, not just the standard ones we support now, in which case having the length be specified as a byte length would be more flexible, since not every charset might have a notion of 'code unit' (and associated unit size). For charsets with a code unit size, converting to a byte length would be trivial any ways (Sorry for the back-and-forth on that). Right now we can't handle a length > Integer.MAX_VALUE because of limitations of ByteBuffer used in the decoding (CharsetDecoder::decode takes ByteBuffer as input), but we wanted to keep this option open for the future, so that's why the length is a `long` above.
Liam, would you be interested in working on these as part of your PR [1]?
Jorn
[1]: https://github.com/openjdk/jdk/pull/28043
[2]:
On 12-11-2025 15:54, Liam Miller-Cushon wrote:
Thanks. I am convinced :)
On Wed, Nov 12, 2025 at 3:30 PM Maurizio Cimadamore <maurizio.cimadamore at oracle.com<mailto:maurizio.cimadamore at oracle.com>> wrote:
On 12/11/2025 11:40, Liam Miller-Cushon wrote:
For the non-\0 terminated strings, you have the String-based MemorySegment::copy I described - e.g.
void copy(String srcString, Charset srcCharset, int srcIndex, MemorySegment dstSegment, long dstOffset, int length);
With this, we also have two cases:
* if the charset is compatible with the string buffer, we just bulk-copy the string buffer (or a portion of it) into the dest segment
* otherwise we can encode the srcString directly into the dest segment
Thanks! I think I'm caught up now. My misunderstanding was whether MS::ofString was being suggested instead of and not in addition to the bulk copy.
Ah, gotcha.
I think MS::ofString is a possible add-on. To be fair, since writing the document I think we've grown a little colder on it, as such a view would make for a pretty big footgun, as it would allow a native function (invoked via critical downcall handle) to directly modify the string buffer (at least in some cases). There's also some question about how `MemorySegment::equals` should work in this case, as `equals` for heap segments takes into account the identity of the underlying heap object.
So, if we could get there with the new `getString`/`copy` + maybe some way to determine the length of an encoded string, I think it would be preferrable/less risky. We could always add `ofString` later, if we find a way to address and/or mitigate the issues above.
Maurizio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251118/b7a8bbd7/attachment-0001.htm>
More information about the panama-dev
mailing list