[External] : Re: MemorySegment APIs for reading and writing strings with known lengths

Jorn Vernee jorn.vernee at oracle.com
Tue Nov 18 19:55:44 UTC 2025


They're not. All Java arrays in HotSpot have similar layouts: object 
header, some padding (this varies between array types), and then the 
actual array data (no indirection). In other words, the same 
restrictions would apply to any Java array type.

Jorn

On 18-11-2025 20:43, Jonathan Rosenne wrote:
>
> Why is an array of bytes any different from any other array?
>
> Best Regards,
>
> Jonathan Rosenne
>
> *From:*Jorn Vernee <jorn.vernee at oracle.com>
> *Sent:* Tuesday, November 18, 2025 8:34 PM
> *To:* Jonathan Rosenne <jr at qsm.co.il>; panama-dev at openjdk.org
> *Subject:* Re: [External] : Re: MemorySegment APIs for reading and 
> writing strings with known lengths
>
> You mean 'arrays of bytes' as in byte[]? The FFM API supports off-heap 
> memory as well, which can't just be treated as a byte[]. For one, it's 
> missing a Java object header. Even if we added a way for a byte[] to 
> just be a pointer to some other memory, which would let us wrap a 
> byte[] object around a native pointer, the entire JVM would need to be 
> updated to handle that alternative format (the current byte[] layout 
> contains no such indirection). In other words: the two in-memory 
> representations are incompatible.
>
> Jorn
>
> On 18-11-2025 19:20, Jonathan Rosenne wrote:
>
>     Excuse my ignorance, but why can't fixed length C strings be
>     treated as arrays of bytes? Java has all the necessary options to
>     convert between byte arrays and strings.
>
>     Best Regards,
>
>     Jonathan Rosenne
>
>     *From:*panama-dev <panama-dev-retn at openjdk.org>
>     <mailto:panama-dev-retn at openjdk.org> *On Behalf Of *Jorn Vernee
>     *Sent:* Tuesday, November 18, 2025 6:52 PM
>     *To:* Liam Miller-Cushon <cushon at google.com>
>     <mailto:cushon at google.com>; Maurizio Cimadamore
>     <maurizio.cimadamore at oracle.com>
>     <mailto:maurizio.cimadamore at oracle.com>
>     *Cc:* panama-dev at openjdk.org
>     *Subject:* Re: [External] : Re: MemorySegment APIs for reading and
>     writing strings with known lengths
>
>     Coming back to this, I think we've settled on the following three
>     methods:
>
>     In MemorySegment:
>
>     String getString(long offset, Charset charset, long length); // as
>     in Liam's PR
>         void copy(String src, Charset dstEncoding, int srcIndex,
>     MemorySegment dst, int numChars);
>
>     And in SegmentAllocator:
>
>     MemorySegment allocateFrom(String src, Charset dstEncoding, int
>     srcIndex, int numChars);
>
>     For encoding directly into a memory segment without the need to go
>     to an intermediate buffer, it looks like we can use the internal
>     StringCharBuffer class, in combination with the
>     `CharsetEncoder::encode` method. But of course we can skip
>     encoding altogether when the internal string encoding matches the
>     target, and just do a bulk copy.
>
>     For allocateFrom, since we don't yet have a way to determine the
>     encoded length of a String, I think we'd still have to go to an
>     intermediate byte[], and then allocate the result segment based on
>     its length. We can still avoid the intermediate byte[] in most
>     cases where the encoding of the String's internal buffer is
>     compatible with the target encoding, and again just do a bulk copy
>     from the string's internal buffer.
>
>     Note on the length parameter for getString: we thought that it
>     might be possible to open this up to any charset, not just the
>     standard ones we support now, in which case having the length be
>     specified as a byte length would be more flexible, since not every
>     charset might have a notion of 'code unit' (and associated unit
>     size). For charsets with a code unit size, converting to a byte
>     length would be trivial any ways (Sorry for the back-and-forth on
>     that). Right now we can't handle a length > Integer.MAX_VALUE
>     because of limitations of ByteBuffer used in the decoding
>     (CharsetDecoder::decode takes ByteBuffer as input), but we wanted
>     to keep this option open for the future, so that's why the length
>     is a `long` above.
>
>     Liam, would you be interested in working on these as part of your
>     PR [1]?
>
>     Jorn
>
>     [1]: https://github.com/openjdk/jdk/pull/28043
>     [2]:
>
>     On 12-11-2025 15:54, Liam Miller-Cushon wrote:
>
>         Thanks. I am convinced :)
>
>         On Wed, Nov 12, 2025 at 3:30 PM Maurizio Cimadamore
>         <maurizio.cimadamore at oracle.com> wrote:
>
>             On 12/11/2025 11:40, Liam Miller-Cushon wrote:
>
>                     For the non-\0 terminated strings, you have the
>                     String-based MemorySegment::copy I described - e.g.
>
>                     void copy(String srcString, Charset srcCharset, int srcIndex, MemorySegment dstSegment, long dstOffset, int length);
>
>                     With this, we also have two cases:
>
>                     * if the charset is compatible with the string
>                     buffer, we just bulk-copy the string buffer (or a
>                     portion of it) into the dest segment
>                     * otherwise we can encode the srcString directly
>                     into the dest segment
>
>                 Thanks! I think I'm caught up now. My misunderstanding
>                 was whether MS::ofString was being suggested instead
>                 of and not in addition to the bulk copy.
>
>             Ah, gotcha.
>
>             I think MS::ofString is a possible add-on. To be fair,
>             since writing the document I think we've grown a little
>             colder on it, as such a view would make for a pretty big
>             footgun, as it would allow a native function (invoked via
>             critical downcall handle) to directly modify the string
>             buffer (at least in some cases). There's also some
>             question about how `MemorySegment::equals` should work in
>             this case, as `equals` for heap segments takes into
>             account the identity of the underlying heap object.
>
>             So, if we could get there with the new `getString`/`copy`
>             + maybe some way to determine the length of an encoded
>             string, I think it would be preferrable/less risky. We
>             could always add `ofString` later, if we find a way to
>             address and/or mitigate the issues above.
>
>             Maurizio
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251118/2f5987e7/attachment-0001.htm>


More information about the panama-dev mailing list