a Saturday puzzler: streaming over variable-length data
John Rose
john.r.rose at oracle.com
Sun Dec 12 00:58:01 UTC 2021
Apologies: That was meant for panama-dev. Please disregard.
On 11 Dec 2021, at 16:42, John Rose wrote:
> Here’s a puzzler (actually a family of puzzlers) that occurred to
> me.
>
> Suppose I, as a Java and Panama programmer, need to communicate arrays
> of strings with native code.
>
> To be specific, let’s talk only about null-terminated UTF8 strings
> (on the native side).
>
> (To be very very general, all of the rest of this discusses uses UTF8
> strings as a “for instance” example, and in fact any kind of
> self-delimiting variable length data would be about as interesting and
> informative. For example, var-ints in the classic form of “bit
> seven means read more bytes after this one”. Moreover, if the data
> is self-synchronizing, as with strings and var-ints, you can write a
> spliterator over it for parallel stream processing.)
>
> So, how do I make a stream over a memory segment of type `char*` that
> consists of a series of zero-terminated UTF8 strings, back to back?
>
> The answers differ a little depending on loop termination:
>
> 0. start with a predefined count of the strings to decode, or
> 1. keep going right up to the upper bound of the segment, or
> 2. stop when an empty string (a pair of null bytes) is found.
>
> Bonus points for avoiding double scanning of the strings. This means
> `MS::getUtf8String` is not necessarily the best tool. (But what is,
> then?)
>
> Second puzzler: How to do the whole thing backwards? That is,
> convert a stream of Java strings back into a memory segment containing
> the UTF8 string bodies concatenated with trailing nulls.
>
> The result is disposed of one of these ways:
>
> A. Allocate a fresh native MS in a given session.
> B. Allocate a fresh heap MS in the global session.
> C. Store the data into a given MS at a given offset, returning the new
> offset, and indicating if there is more that didn’t fit. (And
> allowing restart at that offset, for recovery code.)
>
> The number of converted strings is also reported, corresponding to the
> above options:
>
> 0. return the count of strings encoded and do nothing more, or
> 1. return a segment whose upper bound is after the last string’s
> null, or
> 2. encode an extra empty string (a pair of null bytes) at the end.
>
> And two more puzzlers pop into mind, for an argv/envp array, of type
> `char**`. Here, I suppose that the stream that reads the things will
> walk over a pointer to the `char**` array and read each `char*` item,
> decoding as it goes.
>
> Again the number of items to decode can be determined
>
> 0. with a predefined array-length for the strings to decode, or
> 1. read to the upper bound of the segment holding the array, or
> 2. stop when a `NULL` array element is found.
>
> For the reverse encoding there are a bunch of options:
>
> A. Allocate a *single* fresh native MS in a given session.
> B. Allocate a *pair* of MS’s, with the string bodies in a native MS
> in a given session.
> C. Store the data into one or two given MS’s at one or two a given
> offsets, etc.
>
> And the count can similarly be represented:
>
> 0. return the count of strings encoded and do nothing more, or
> 1. return a segment whose upper bound is after the last array element,
> or
> 2. encode an extra empty `NULL` element at the end of the output array
>
> The options C above are tricky but some applications may choose to
> embrace the complexity in order to reduce end-to-end copying. That in
> turn suggests that maybe someone should create a library to manage
> blocks of working storage that accumulate data structures for eventual
> posting to other native code.
>
> A very highly developed framework might also emphasize pointer-free
> and/or position-independent data structures. The various structured
> packet frameworks do this, one way or another. My favorite is
> https://capnproto.org !
More information about the valhalla-dev
mailing list