[External] : Re: MemorySegment APIs for reading and writing strings with known lengths
Jorn Vernee
jorn.vernee at oracle.com
Tue Nov 18 18:33:30 UTC 2025
You mean 'arrays of bytes' as in byte[]? The FFM API supports off-heap
memory as well, which can't just be treated as a byte[]. For one, it's
missing a Java object header. Even if we added a way for a byte[] to
just be a pointer to some other memory, which would let us wrap a byte[]
object around a native pointer, the entire JVM would need to be updated
to handle that alternative format (the current byte[] layout contains no
such indirection). In other words: the two in-memory representations are
incompatible.
Jorn
On 18-11-2025 19:20, Jonathan Rosenne wrote:
>
> Excuse my ignorance, but why can't fixed length C strings be treated
> as arrays of bytes? Java has all the necessary options to convert
> between byte arrays and strings.
>
> Best Regards,
>
> Jonathan Rosenne
>
> *From:*panama-dev <panama-dev-retn at openjdk.org> *On Behalf Of *Jorn Vernee
> *Sent:* Tuesday, November 18, 2025 6:52 PM
> *To:* Liam Miller-Cushon <cushon at google.com>; Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com>
> *Cc:* panama-dev at openjdk.org
> *Subject:* Re: [External] : Re: MemorySegment APIs for reading and
> writing strings with known lengths
>
> Coming back to this, I think we've settled on the following three methods:
>
> In MemorySegment:
>
> String getString(long offset, Charset charset, long length); // as
> in Liam's PR
> void copy(String src, Charset dstEncoding, int srcIndex,
> MemorySegment dst, int numChars);
>
> And in SegmentAllocator:
>
> MemorySegment allocateFrom(String src, Charset dstEncoding, int
> srcIndex, int numChars);
>
> For encoding directly into a memory segment without the need to go to
> an intermediate buffer, it looks like we can use the internal
> StringCharBuffer class, in combination with the
> `CharsetEncoder::encode` method. But of course we can skip encoding
> altogether when the internal string encoding matches the target, and
> just do a bulk copy.
>
> For allocateFrom, since we don't yet have a way to determine the
> encoded length of a String, I think we'd still have to go to an
> intermediate byte[], and then allocate the result segment based on its
> length. We can still avoid the intermediate byte[] in most cases where
> the encoding of the String's internal buffer is compatible with the
> target encoding, and again just do a bulk copy from the string's
> internal buffer.
>
> Note on the length parameter for getString: we thought that it might
> be possible to open this up to any charset, not just the standard ones
> we support now, in which case having the length be specified as a byte
> length would be more flexible, since not every charset might have a
> notion of 'code unit' (and associated unit size). For charsets with a
> code unit size, converting to a byte length would be trivial any ways
> (Sorry for the back-and-forth on that). Right now we can't handle a
> length > Integer.MAX_VALUE because of limitations of ByteBuffer used
> in the decoding (CharsetDecoder::decode takes ByteBuffer as input),
> but we wanted to keep this option open for the future, so that's why
> the length is a `long` above.
>
> Liam, would you be interested in working on these as part of your PR [1]?
>
> Jorn
>
> [1]: https://github.com/openjdk/jdk/pull/28043
> [2]:
>
> On 12-11-2025 15:54, Liam Miller-Cushon wrote:
>
> Thanks. I am convinced :)
>
> On Wed, Nov 12, 2025 at 3:30 PM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>
> On 12/11/2025 11:40, Liam Miller-Cushon wrote:
>
> For the non-\0 terminated strings, you have the
> String-based MemorySegment::copy I described - e.g.
>
> void copy(String srcString, Charset srcCharset, int srcIndex, MemorySegment dstSegment, long dstOffset, int length);
>
> With this, we also have two cases:
>
> * if the charset is compatible with the string buffer,
> we just bulk-copy the string buffer (or a portion of
> it) into the dest segment
> * otherwise we can encode the srcString directly into
> the dest segment
>
> Thanks! I think I'm caught up now. My misunderstanding was
> whether MS::ofString was being suggested instead of and
> not in addition to the bulk copy.
>
> Ah, gotcha.
>
> I think MS::ofString is a possible add-on. To be fair, since
> writing the document I think we've grown a little colder on
> it, as such a view would make for a pretty big footgun, as it
> would allow a native function (invoked via critical downcall
> handle) to directly modify the string buffer (at least in some
> cases). There's also some question about how
> `MemorySegment::equals` should work in this case, as `equals`
> for heap segments takes into account the identity of the
> underlying heap object.
>
> So, if we could get there with the new `getString`/`copy` +
> maybe some way to determine the length of an encoded string, I
> think it would be preferrable/less risky. We could always add
> `ofString` later, if we find a way to address and/or mitigate
> the issues above.
>
> Maurizio
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251118/dfccf42c/attachment-0001.htm>
More information about the panama-dev
mailing list