[External] : Re: MemorySegment APIs for reading and writing strings with known lengths

Tue Nov 18 18:33:30 UTC 2025

You mean 'arrays of bytes' as in byte[]? The FFM API supports off-heap 
memory as well, which can't just be treated as a byte[]. For one, it's 
missing a Java object header. Even if we added a way for a byte[] to 
just be a pointer to some other memory, which would let us wrap a byte[] 
object around a native pointer, the entire JVM would need to be updated 
to handle that alternative format (the current byte[] layout contains no 
such indirection). In other words: the two in-memory representations are 
incompatible.

Jorn

On 18-11-2025 19:20, Jonathan Rosenne wrote:
>
> Excuse my ignorance, but why can't fixed length C strings be treated 
> as arrays of bytes? Java has all the necessary options to convert 
> between byte arrays and strings.
>
> Best Regards,
>
> Jonathan Rosenne
>
> *From:*panama-dev <panama-dev-retn at openjdk.org> *On Behalf Of *Jorn Vernee
> *Sent:* Tuesday, November 18, 2025 6:52 PM
> *To:* Liam Miller-Cushon <cushon at google.com>; Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com>
> *Cc:* panama-dev at openjdk.org
> *Subject:* Re: [External] : Re: MemorySegment APIs for reading and 
> writing strings with known lengths
>
> Coming back to this, I think we've settled on the following three methods:
>
> In MemorySegment:
>
>     String getString(long offset, Charset charset, long length); // as 
> in Liam's PR
>     void copy(String src, Charset dstEncoding, int srcIndex, 
> MemorySegment dst, int numChars);
>
> And in SegmentAllocator:
>
> MemorySegment allocateFrom(String src, Charset dstEncoding, int 
> srcIndex, int numChars);
>
> For encoding directly into a memory segment without the need to go to 
> an intermediate buffer, it looks like we can use the internal 
> StringCharBuffer class, in combination with the 
> `CharsetEncoder::encode` method. But of course we can skip encoding 
> altogether when the internal string encoding matches the target, and 
> just do a bulk copy.
>
> For allocateFrom, since we don't yet have a way to determine the 
> encoded length of a String, I think we'd still have to go to an 
> intermediate byte[], and then allocate the result segment based on its 
> length. We can still avoid the intermediate byte[] in most cases where 
> the encoding of the String's internal buffer is compatible with the 
> target encoding, and again just do a bulk copy from the string's 
> internal buffer.
>
> Note on the length parameter for getString: we thought that it might 
> be possible to open this up to any charset, not just the standard ones 
> we support now, in which case having the length be specified as a byte 
> length would be more flexible, since not every charset might have a 
> notion of 'code unit' (and associated unit size). For charsets with a 
> code unit size, converting to a byte length would be trivial any ways 
> (Sorry for the back-and-forth on that). Right now we can't handle a 
> length > Integer.MAX_VALUE because of limitations of ByteBuffer used 
> in the decoding (CharsetDecoder::decode takes ByteBuffer as input), 
> but we wanted to keep this option open for the future, so that's why 
> the length is a `long` above.
>
> Liam, would you be interested in working on these as part of your PR [1]?
>
> Jorn
>
> [1]: https://github.com/openjdk/jdk/pull/28043
> [2]:
>
> On 12-11-2025 15:54, Liam Miller-Cushon wrote:
>
>     Thanks. I am convinced :)
>
>     On Wed, Nov 12, 2025 at 3:30 PM Maurizio Cimadamore
>     <maurizio.cimadamore at oracle.com> wrote:
>
>         On 12/11/2025 11:40, Liam Miller-Cushon wrote:
>
>                 For the non-\0 terminated strings, you have the
>                 String-based MemorySegment::copy I described - e.g.
>
>                 void copy(String srcString, Charset srcCharset, int srcIndex, MemorySegment dstSegment, long dstOffset, int length);
>
>                 With this, we also have two cases:
>
>                 * if the charset is compatible with the string buffer,
>                 we just bulk-copy the string buffer (or a portion of
>                 it) into the dest segment
>                 * otherwise we can encode the srcString directly into
>                 the dest segment
>
>             Thanks! I think I'm caught up now. My misunderstanding was
>             whether MS::ofString was being suggested instead of and
>             not in addition to the bulk copy.
>
>         Ah, gotcha.
>
>         I think MS::ofString is a possible add-on. To be fair, since
>         writing the document I think we've grown a little colder on
>         it, as such a view would make for a pretty big footgun, as it
>         would allow a native function (invoked via critical downcall
>         handle) to directly modify the string buffer (at least in some
>         cases). There's also some question about how
>         `MemorySegment::equals` should work in this case, as `equals`
>         for heap segments takes into account the identity of the
>         underlying heap object.
>
>         So, if we could get there with the new `getString`/`copy` +
>         maybe some way to determine the length of an encoded string, I
>         think it would be preferrable/less risky. We could always add
>         `ofString` later, if we find a way to address and/or mitigate
>         the issues above.
>
>         Maurizio
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251118/dfccf42c/attachment-0001.htm>