Additional mismatch overloads for memory segments
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jun 16 14:00:52 UTC 2022
Hi,
I've spent some time classifying the various instance methods in
MemorySegment, to see if some pattern emerged:
(A) streaming ops:
* elements(MemoryLayout)
* spliterator(MemoryLayout)
(B) slicing ops
* asSlice(long)
* asSlice(long, long)
(C) containment ops
* asOverlappingSlice(MemorySegment)
* segmentOffset(MemorySegment)
(D) bulk ops
* fill(byte)
* copyFrom(MemorySegment)
* mismatch(MemorySegment)
(E) mapped segment ops
* isLoaded()
* load()
* unload()
* force()
(F) transform ops
* asByteBuffer()
* toArray(ValueLayout.OfT)
(G) simple dereference ops
* get(ValueLayout.OfT, long)
* set(ValueLayout.OfT, long, T)
(H) indexed dereference ops
getAtIndex(ValueLayout.OfT, long)
setAtIndex(ValueLayout.OfT, long, T)
(I) string dereferene ops
getUtf8String(long)
setUtf8String(long, String)
--------------------------------------
Now, let's see if, case by case, we can see whether some additional
overload would be desirable or not. As Jorn pointed out, it's not just
about having an offset overload, in some cases offet+size might be better.
First, I think (B), (C) and (G) are ok as is. (B) and (G) already
provides offset/size accepting variants, and (C) is more about querying
whether two segments are related, so adding offsets here seems odd.
For (A), (D), (E), (F) you could argue that having an overload that
accepts a size and an offset (or two telescopic overloads, one with only
offset, another with offset _and_ size), might make sense. Zooming in a
bit more, I think some further distinctions are in order.
First, copy already provides many static methods in the style of
System.arrayCopy. So I think it's fair to say that anybody who wants
more control, should use the static method instead. If we add a static
mismatch, I think the same argument might apply there.
Similarly, toArray is a nice convenience method, but it's really just a
shortcut for a bulk copy - again, the static methods are enough for
users who want more control.
Which leaves us with:
* elements(MemoryLayout) // A
* spliterator(MemoryLayout) // A
* fill(byte) // D
* isLoaded() // E
* load() // E
* unload() // E
* force() // E
* asByteBuffer // F
Adding one or two overloads for both is doable, but feels like overkill
API-wise. I also note that MappedByteBuffer does not provide overloads
for isLoaded/load/unload. Only force() has an overload, that was added
in 13 [1], and, I believe, because ByteBuffer::force relies on the
buffer mutable position (see last section in [2]), thus making it
impossible to flush independent mapped regions in parallel using the
same buffer. MemorySegments do not have that issue, given that they are
immutable - so one can create 20 disjoint slices of a mapped segment and
force() them in parallel. And, some of these methods allocate several
objects anyway (esp. elements and spliterator), so I don't see a lot of
reasons as to why slicing shouldn't just work there.
So I believe we could probably just add the following (although I
realize this is an arbitrary cut):
* fill(byte, long, long) // D
As for (H), I think the main problem here is that dereference methods in
(H) are not built on top of those in (G). I think the following form:
getAtIndex(ValueLayout.OfT, long, long)
setAtIndex(ValueLayout.OfT, long, long, T)
Would be more useful, where the first parameter is an offset into the
segment, and the second is a logical index. Seems strictly more general.
Actually note that these new methods are general enough that, in
principle, we could only have these, and be done (e.g. just one set of
accessors instead of two). Or, instead of having two sets, for offset
and offset+index, we could have two sets for index and offset+index -
which seems more consistent with the rest of the instance methods (where
the offset variant is the "advanced" one).
Finally, I think (I) the the most complex of all. Because, if we add
something like:
getUtf8String(long, long)
At least to me it's not entirely clear what the "length" parameter does.
Does it means (like in all the other methods discussed so far) that it's
like doing asSlice(offset, length) and then getUtf8String? Or does it
mean (as Jorn seemed to imply) that the length parameter refers to the
string length? Also, what if user specifies a string length, but the
memory region contains a terminator char _before_ the user-specified
length? Honestly, it is not clear to me that what we do in (I) is
clearly lacking, or could be improved in a straightforward way. If user
just wants to view a slice of memory as a string, and ignore
terminators, they can just use bulk copy, to copy the contents of the
segment into a byte array, and then create the string from there (which
is what our impl does anyway) - it's all bytes at the end of the day -
IMHO the interesting thing about the string dereference method *is* the
terminator handling. W/o that it's just a copy.
So, summarizing, to make the API more offset/length friendly we could:
1. Add a static mismatch method(s) (similar to what we did for copy)
2. add the following overloads:
* fill(byte, long, long) // D
3. Tweak the indexed accessors - e.g. rename them to just "get" and get
both an offset (in bytes) parameter as well as an index (in elements)
parameter; then decide what to do with the offset-only dereference variant:
3a) keep as is
3b) drop
3c) tweak it so that parameter is an index, not an offset
In this plan, the most controversial part is (2) - in the sense that it
involves an arbitrary choice based on the uses we've seen on the
ByteBuffer API, as well as some other considerations (e.g. how much do
the methods end up allocating anyway). As for (3), I think the more
general offset+index dereference methods, give us an opportunity to
think if the set of dereference we provide is the one that makes the
most sense.
Thoughts?
Maurizio
[1] - https://bugs.openjdk.org/browse/JDK-8221696
[2] - https://openjdk.org/jeps/352
On 15/06/2022 14:46, Jorn Vernee wrote:
> Just wanted to mention that it's not just offsets, but also lengths in
> some cases.
>
> For instance, fill might need a length and offset overload. And I've
> also felt in the past that getUtf8String, which currently only takes
> an offset, should be split into 2 overloads, one that takes an offset
> and length, and one that takes no arguments (which would internally
> look for a null terminator).
>
> Jorn
>
> On 14/06/2022 14:40, Maurizio Cimadamore wrote:
>>
>> On 14/06/2022 13:03, Alexander Biryukov wrote:
>>> That makes sense, and I think your description pretty much sums it up.
>>> Maybe it can be transformed in a checklist:
>>>
>>> 1. Instance methods are preferred
>>> 2. Binary operators are implemented as static and instance methods,
>>> both with offsets
>>> 3. Some instance methods have offset-related overloads, where it
>>> makes sense
>>> 4. Pure static methods are not welcome, except for some widespread
>>> patterns, like factory methods (*)
>>
>> I would split (3) as follows:
>>
>> 3a. Instance methods have an additional offset parameter
>> 3b. Where it makes sense, offset-free overloads might be provided
>> (e.g. call method in (3a) with offset=0)
>>
>> This feels more principled, as (3a) is a primitive and (3b) is sugare
>> than can easily be derived from that.
>>
>> Re. your point of using analytics, again, this is a relatively new
>> API, there's not much code using it out there, so the risk of
>> "overfitting" is real.
>>
>> I'm slightly on the fence w.r.t. having both static and instance
>> variants for binary methods - that said, as long as we can clearly
>> explain what's the primitive (the static method), I think the API can
>> still be explain in a relatively straightforward fashion.
>>
>> Maurizio
>>
>>
>>
More information about the panama-dev
mailing list