Additional mismatch overloads for memory segments

Fri Jun 17 13:02:56 UTC 2022

On 17/06/2022 13:49, Alexander Biryukov wrote:
>
>     1. Add a static mismatch method(s) (similar to what we did for copy)
>     2. add the following overloads:
>     * fill(byte, long, long) // D
>     3. Tweak the indexed accessors - e.g. rename them to just "get"
>     and get
>     both an offset (in bytes) parameter as well as an index (in elements)
>     parameter; then decide what to do with the offset-only dereference
>     variant:
>     3a) keep as is
>     3b) drop
>     3c) tweak it so that parameter is an index, not an offset 
>
>
> 1. Agree, basic functionality like copy/mismatch must expose an offset
> 2. Can't see harm here, so I'd add an overload
> 3. I think the current get/set methods with offsets are fine as is (3a).
> I'm not sure if merging them together with setAtIndex is a good idea,
> something like set(JAVA_INT, 0, 0, 123) looks confusing.
> I expect that "index" means that it starts with 0.
> Otherwise I'd just use set/get and do offset + i * 2/4/8, it's much 
> simpler.
Point taken. Note that option 3c though would give you the same e.g.

set(JAVA_INT, 0, 123) // sets 123 at index 0

and, also have an advanced option:

set(JAVA_INT, offset, 0, 123) // sets 123 at index 0, starting at offset

Doing I *2/4/8 is, of course possible, but looks redundant, given that 
you are also passing a layout.

That said, I don't feel particularly strongly about any of the options 
here - and I agree that the API points provided now are expressive 
enough (assuming one wants to do offset computation by hand).

> By the way, about getUtf8String(long, long).There are multiple ways to 
> put UTF-8 strings in byte[],
> like null-termination or write length as the first 
> byte/int/varint/stored somewhere else (schema).
> Current implementation is a null-terminated string, which is fine for 
> C interop.
> Perhaps naming could be better to reflect it.

True. Best we could do is probably dropping a C somewhere in there. 
(e.g. getUtf8CString), as I don't think "getUtf8NullTerminatedString" 
would fly :-)

Maurizio

>
>
>
> чт, 16 июн. 2022 г. в 17:01, Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com>:
>
>     Hi,
>     I've spent some time classifying the various instance methods in
>     MemorySegment, to see if some pattern emerged:
>
>     (A) streaming ops:
>
>     * elements(MemoryLayout)
>     * spliterator(MemoryLayout)
>
>     (B) slicing ops
>
>     * asSlice(long)
>     * asSlice(long, long)
>
>     (C) containment ops
>
>     * asOverlappingSlice(MemorySegment)
>     * segmentOffset(MemorySegment)
>
>     (D) bulk ops
>
>     * fill(byte)
>     * copyFrom(MemorySegment)
>     * mismatch(MemorySegment)
>
>     (E) mapped segment ops
>
>     * isLoaded()
>     * load()
>     * unload()
>     * force()
>
>     (F) transform ops
>
>     * asByteBuffer()
>     * toArray(ValueLayout.OfT)
>
>     (G) simple dereference ops
>
>     * get(ValueLayout.OfT, long)
>     * set(ValueLayout.OfT, long, T)
>
>     (H) indexed dereference ops
>
>     getAtIndex(ValueLayout.OfT, long)
>     setAtIndex(ValueLayout.OfT, long, T)
>
>     (I) string dereferene ops
>
>     getUtf8String(long)
>     setUtf8String(long, String)
>
>     --------------------------------------
>
>     Now, let's see if, case by case, we can see whether some additional
>     overload would be desirable or not. As Jorn pointed out, it's not
>     just
>     about having an offset overload, in some cases offet+size might be
>     better.
>
>     First, I think (B), (C) and (G) are ok as is. (B) and (G) already
>     provides offset/size accepting variants, and (C) is more about
>     querying
>     whether two segments are related, so adding offsets here seems odd.
>
>     For (A), (D), (E), (F) you could argue that having an overload that
>     accepts a size and an offset (or two telescopic overloads, one
>     with only
>     offset, another with offset _and_ size), might make sense. Zooming
>     in a
>     bit more, I think some further distinctions are in order.
>
>     First, copy already provides many static methods in the style of
>     System.arrayCopy. So I think it's fair to say that anybody who wants
>     more control, should use the static method instead. If we add a
>     static
>     mismatch, I think the same argument might apply there.
>
>     Similarly, toArray is a nice convenience method, but it's really
>     just a
>     shortcut for a bulk copy - again, the static methods are enough for
>     users who want more control.
>
>     Which leaves us with:
>
>     * elements(MemoryLayout) // A
>     * spliterator(MemoryLayout) // A
>     * fill(byte) // D
>     * isLoaded() // E
>     * load() // E
>     * unload() // E
>     * force() // E
>     * asByteBuffer // F
>
>     Adding one or two overloads for both is doable, but feels like
>     overkill
>     API-wise. I also note that MappedByteBuffer does not provide
>     overloads
>     for isLoaded/load/unload. Only force() has an overload, that was
>     added
>     in 13 [1], and, I believe, because ByteBuffer::force relies on the
>     buffer mutable position (see last section in [2]), thus making it
>     impossible to flush independent mapped regions in parallel using the
>     same buffer. MemorySegments do not have that issue, given that
>     they are
>     immutable - so one can create 20 disjoint slices of a mapped
>     segment and
>     force() them in parallel.  And, some of these methods allocate
>     several
>     objects anyway (esp. elements and spliterator), so I don't see a
>     lot of
>     reasons as to why slicing shouldn't just work there.
>
>     So I believe we could probably just add the following (although I
>     realize this is an arbitrary cut):
>
>     * fill(byte, long, long) // D
>
>     As for (H), I think the main problem here is that dereference
>     methods in
>     (H) are not built on top of those in (G). I think the following form:
>
>     getAtIndex(ValueLayout.OfT, long, long)
>     setAtIndex(ValueLayout.OfT, long, long, T)
>
>     Would be more useful, where the first parameter is an offset into the
>     segment, and the second is a logical index. Seems strictly more
>     general.
>     Actually note that these new methods are general enough that, in
>     principle, we could only have these, and be done (e.g. just one
>     set of
>     accessors instead of two). Or, instead of having two sets, for offset
>     and offset+index, we could have two sets for index and offset+index -
>     which seems more consistent with the rest of the instance methods
>     (where
>     the offset variant is the "advanced" one).
>
>     Finally, I think (I) the the most complex of all. Because, if we add
>     something like:
>
>     getUtf8String(long, long)
>
>     At least to me it's not entirely clear what the "length" parameter
>     does.
>     Does it means (like in all the other methods discussed so far)
>     that it's
>     like doing asSlice(offset, length) and then getUtf8String? Or does it
>     mean (as Jorn seemed to imply) that the length parameter refers to
>     the
>     string length? Also, what if user specifies a string length, but the
>     memory region contains a terminator char _before_ the user-specified
>     length? Honestly, it is not clear to me that what we do in (I) is
>     clearly lacking, or could be improved in a straightforward way. If
>     user
>     just wants to view a slice of memory as a string, and ignore
>     terminators, they can just use bulk copy, to copy the contents of the
>     segment into a byte array, and then create the string from there
>     (which
>     is what our impl does anyway) - it's all bytes at the end of the
>     day -
>     IMHO the interesting thing about the string dereference method
>     *is* the
>     terminator handling. W/o that it's just a copy.
>
>     So, summarizing, to make the API more offset/length friendly we could:
>
>     1. Add a static mismatch method(s) (similar to what we did for copy)
>     2. add the following overloads:
>     * fill(byte, long, long) // D
>     3. Tweak the indexed accessors - e.g. rename them to just "get"
>     and get
>     both an offset (in bytes) parameter as well as an index (in elements)
>     parameter; then decide what to do with the offset-only dereference
>     variant:
>     3a) keep as is
>     3b) drop
>     3c) tweak it so that parameter is an index, not an offset
>
>     In this plan, the most controversial part is (2) - in the sense
>     that it
>     involves an arbitrary choice based on the uses we've seen on the
>     ByteBuffer API, as well as some other considerations (e.g. how
>     much do
>     the methods end up allocating anyway). As for (3), I think the more
>     general offset+index dereference methods, give us an opportunity to
>     think if the set of dereference we provide is the one that makes the
>     most sense.
>
>     Thoughts?
>
>     Maurizio
>
>     [1] - https://bugs.openjdk.org/browse/JDK-8221696
>     [2] - https://openjdk.org/jeps/352
>
>     On 15/06/2022 14:46, Jorn Vernee wrote:
>     > Just wanted to mention that it's not just offsets, but also
>     lengths in
>     > some cases.
>     >
>     > For instance, fill might need a length and offset overload. And
>     I've
>     > also felt in the past that getUtf8String, which currently only
>     takes
>     > an offset, should be split into 2 overloads, one that takes an
>     offset
>     > and length, and one that takes no arguments (which would internally
>     > look for a null terminator).
>     >
>     > Jorn
>     >
>     > On 14/06/2022 14:40, Maurizio Cimadamore wrote:
>     >>
>     >> On 14/06/2022 13:03, Alexander Biryukov wrote:
>     >>> That makes sense, and I think your description pretty much
>     sums it up.
>     >>> Maybe it can be transformed in a checklist:
>     >>>
>     >>> 1. Instance methods are preferred
>     >>> 2. Binary operators are implemented as static and instance
>     methods,
>     >>> both with offsets
>     >>> 3. Some instance methods have offset-related overloads, where it
>     >>> makes sense
>     >>> 4. Pure static methods are not welcome, except for some
>     widespread
>     >>> patterns, like factory methods (*)
>     >>
>     >> I would split (3) as follows:
>     >>
>     >> 3a. Instance methods have an additional offset parameter
>     >> 3b. Where it makes sense, offset-free overloads might be provided
>     >> (e.g. call method in (3a) with offset=0)
>     >>
>     >> This feels more principled, as (3a) is a primitive and (3b) is
>     sugare
>     >> than can easily be derived from that.
>     >>
>     >> Re. your point of using analytics, again, this is a relatively new
>     >> API, there's not much code using it out there, so the risk of
>     >> "overfitting" is real.
>     >>
>     >> I'm slightly on the fence w.r.t. having both static and instance
>     >> variants for binary methods - that said, as long as we can clearly
>     >> explain what's the primitive (the static method), I think the
>     API can
>     >> still be explain in a relatively straightforward fashion.
>     >>
>     >> Maurizio
>     >>
>     >>
>     >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220617/65250eb7/attachment-0001.htm>