[External] : Re: MemorySegment APIs for reading and writing strings with known lengths
Jorn Vernee
jorn.vernee at oracle.com
Tue Nov 18 16:52:05 UTC 2025
Coming back to this, I think we've settled on the following three methods:
In MemorySegment:
String getString(long offset, Charset charset, long length); // as
in Liam's PR
void copy(String src, Charset dstEncoding, int srcIndex,
MemorySegment dst, int numChars);
And in SegmentAllocator:
MemorySegment allocateFrom(String src, Charset dstEncoding, int
srcIndex, int numChars);
For encoding directly into a memory segment without the need to go to an
intermediate buffer, it looks like we can use the internal
StringCharBuffer class, in combination with the `CharsetEncoder::encode`
method. But of course we can skip encoding altogether when the internal
string encoding matches the target, and just do a bulk copy.
For allocateFrom, since we don't yet have a way to determine the encoded
length of a String, I think we'd still have to go to an intermediate
byte[], and then allocate the result segment based on its length. We can
still avoid the intermediate byte[] in most cases where the encoding of
the String's internal buffer is compatible with the target encoding, and
again just do a bulk copy from the string's internal buffer.
Note on the length parameter for getString: we thought that it might be
possible to open this up to any charset, not just the standard ones we
support now, in which case having the length be specified as a byte
length would be more flexible, since not every charset might have a
notion of 'code unit' (and associated unit size). For charsets with a
code unit size, converting to a byte length would be trivial any ways
(Sorry for the back-and-forth on that). Right now we can't handle a
length > Integer.MAX_VALUE because of limitations of ByteBuffer used in
the decoding (CharsetDecoder::decode takes ByteBuffer as input), but we
wanted to keep this option open for the future, so that's why the length
is a `long` above.
Liam, would you be interested in working on these as part of your PR [1]?
Jorn
[1]: https://github.com/openjdk/jdk/pull/28043
[2]:
On 12-11-2025 15:54, Liam Miller-Cushon wrote:
> Thanks. I am convinced :)
>
> On Wed, Nov 12, 2025 at 3:30 PM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>
>
> On 12/11/2025 11:40, Liam Miller-Cushon wrote:
>>
>> For the non-\0 terminated strings, you have the String-based
>> MemorySegment::copy I described - e.g.
>>
>> void copy(String srcString, Charset srcCharset, int srcIndex,
>> MemorySegment dstSegment, long dstOffset, int length);
>>
>> With this, we also have two cases:
>>
>> * if the charset is compatible with the string buffer, we
>> just bulk-copy the string buffer (or a portion of it) into
>> the dest segment
>> * otherwise we can encode the srcString directly into the
>> dest segment
>>
>> Thanks! I think I'm caught up now. My misunderstanding was
>> whether MS::ofString was being suggested instead of and not in
>> addition to the bulk copy.
>
> Ah, gotcha.
>
> I think MS::ofString is a possible add-on. To be fair, since
> writing the document I think we've grown a little colder on it, as
> such a view would make for a pretty big footgun, as it would allow
> a native function (invoked via critical downcall handle) to
> directly modify the string buffer (at least in some cases).
> There's also some question about how `MemorySegment::equals`
> should work in this case, as `equals` for heap segments takes into
> account the identity of the underlying heap object.
>
> So, if we could get there with the new `getString`/`copy` + maybe
> some way to determine the length of an encoded string, I think it
> would be preferrable/less risky. We could always add `ofString`
> later, if we find a way to address and/or mitigate the issues above.
>
> Maurizio
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251118/193c4a55/attachment-0001.htm>
More information about the panama-dev
mailing list