[External] : Re: MemorySegment APIs for reading and writing strings with known lengths
Jorn Vernee
jorn.vernee at oracle.com
Tue Nov 4 14:06:37 UTC 2025
> I wouldn't have expected looping with charAt to be competitive with
the fast paths in StringSupport where the bytes are compatible, or is
that not right?
The fast paths in StringSupport call an out-of-line stub that does a
vectorized copy. At least in theory C2's auto-vectorizer should be able
to do the exact same thing for a manual loop using charAt, but inline.
i.e. it might even be faster, especially for small strings. That's why
it would be good to try that approach and see how it compares.
> Do you have a feeling for how that approach might best be exposed in
the API? Do you think it might look like more variations of
getString/setString in MemorySegment? Or that there might be a missing
API primitive that could encapsulate those String sources and
destinations? Or something else?
I was thinking primarily along the lines of adding a MemorySegment::copy
overload that accepts Strings as a source (as opposed to e.g. an array),
for copying from a string to a memory segment only. We should probably
also add an overload to SegmentAllocator::allocateFrom that accepts an
offset and a length (we already have two for full strings). These two
overloads could fully support the sub string use case without looking
too out of place.
For reading a String, I think your proposal to augment
MemorySegment::getString looks good, but I think we should leave
setString alone in favor of adding a MS::copy overload (there's the
asymmetry I was talking about before).
For completeness, I think we should also just add the
MemorySegment::ofString(String, CharSet) overload which tries to return
a read-only view of the string, to match the existing ofArray methods.
This seems generally just a good primitive to have.
Jorn
On 4-11-2025 14:13, Liam Miller-Cushon wrote:
> Hi Jorn,
>
> Thanks for the discussion and input!
>
> On Mon, Nov 3, 2025 at 7:47 PM Jorn Vernee <jorn.vernee at oracle.com> wrote:
>
> About the copy elision, this is something that we have seen being
> visible in benchmarks [1]. Were there any benchmarks in which you've
> seen this difference show up as well, or is this more of a
> theoretical
> benefit? It would be good to understand how important performance
> is in
> all of this, or if it's more about API usability. Also, I should note
> that in the case of setString, we only avoid the extra byte[] when
> the
> source and target encodings are compatible.
>
>
> We had done some earlier benchmarking, I think that was part of the
> discussion that led to JDK-8362893:
> https://mail.openjdk.org/pipermail/core-libs-dev/2025-July/149189.html.
> I also made some draft changes to
> https://github.com/openjdk/jdk/pull/28043
> <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/28043__;!!ACWV5N9M2RV99hQ!Kos8ibRx_a7AZYrGm-cuDJfIZAnnCayLOh3DWGCmkyPh3Dgi5ZtZxbfvytQvz0gUJ6depYeemuNMgtkN$>
> to add a prototype of setStringWithoutNullTerminator and did some more
> microbenchmarking. I updated the PR description with some results.
>
> For use-cases like the protobuf one, the interest is more in getting
> the best possible performance, rather than API usability.
>
> It's possible to call `getBytes` on a string and copy the
> resulting byte[]
>
> into the memory segment as well, but I suppose you want to avoid that
>
> because of the extra copy of the byte[] (although I think perhaps
> C2 can
>
> elide the extra object in that case)? Have you tried looping over
> the string
>
> and manually copying each character using charAt?
>
>
> I'm not seeing competitive performance with the explicit call to
> getBytes in the microbenchmarks, so I wonder if it is perhaps not
> eliding the copy, although I haven't verified in the assembly.
>
> I wouldn't have expected looping with charAt to be competitive with
> the fast paths in StringSupport where the bytes are compatible, or is
> that not right?
>
> I think I'm slowly coming to the conclusion that we should just treat
> Strings as another source and destination format for data, with the
> caveat that we can not modify a String in place, so any read
> operations
> will have to create a new String instance instead. This creates some
> asymmetry with the existing MemorySegment::copy methods. I think
> because
> of that restriction, we might have to accept some asymmetry
> between the
> read and write APIs for strings as well.
>
>
> Do you have a feeling for how that approach might best be exposed in
> the API? Do you think it might look like more variations of
> getString/setString in MemorySegment? Or that there might be a missing
> API primitive that could encapsulate those String sources and
> destinations? Or something else?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251104/73034541/attachment.htm>
More information about the panama-dev
mailing list