a Saturday puzzler: streaming over variable-length data

John Rose john.r.rose at oracle.com
Sun Dec 12 00:58:01 UTC 2021


Apologies:  That was meant for panama-dev.  Please disregard.

On 11 Dec 2021, at 16:42, John Rose wrote:

> Here’s a puzzler (actually a family of puzzlers) that occurred to 
> me.
>
> Suppose I, as a Java and Panama programmer, need to communicate arrays 
> of strings with native code.
>
> To be specific, let’s talk only about null-terminated UTF8 strings 
> (on the native side).
>
> (To be very very general, all of the rest of this discusses uses UTF8 
> strings as a “for instance” example, and in fact any kind of 
> self-delimiting variable length data would be about as interesting and 
> informative.  For example, var-ints in the classic form of “bit 
> seven means read more bytes after this one”.  Moreover, if the data 
> is self-synchronizing, as with strings and var-ints, you can write a 
> spliterator over it for parallel stream processing.)
>
> So, how do I make a stream over a memory segment of type `char*` that 
> consists of a series of zero-terminated UTF8 strings, back to back?
>
> The answers differ a little depending on loop termination:
>
> 0. start with a predefined count of the strings to decode, or
> 1. keep going right up to the upper bound of the segment, or
> 2. stop when an empty string (a pair of null bytes) is found.
>
> Bonus points for avoiding double scanning of the strings.  This means 
> `MS::getUtf8String` is not necessarily the best tool.  (But what is, 
> then?)
>
> Second puzzler:  How to do the whole thing backwards?  That is, 
> convert a stream of Java strings back into a memory segment containing 
> the UTF8 string bodies concatenated with trailing nulls.
>
> The result is disposed of one of these ways:
>
> A. Allocate a fresh native MS in a given session.
> B. Allocate a fresh heap MS in the global session.
> C. Store the data into a given MS at a given offset, returning the new 
> offset, and indicating if there is more that didn’t fit.  (And 
> allowing restart at that offset, for recovery code.)
>
> The number of converted strings is also reported, corresponding to the 
> above options:
>
> 0. return the count of strings encoded and do nothing more, or
> 1. return a segment whose upper bound is after the last string’s 
> null, or
> 2. encode an extra empty string (a pair of null bytes) at the end.
>
> And two more puzzlers pop into mind, for an argv/envp array, of type 
> `char**`.  Here, I suppose that the stream that reads the things will 
> walk over a pointer to the `char**` array and read each `char*` item, 
> decoding as it goes.
>
> Again the number of items to decode can be determined
>
> 0. with a predefined array-length for the strings to decode, or
> 1. read to the upper bound of the segment holding the array, or
> 2. stop when a `NULL` array element is found.
>
> For the reverse encoding there are a bunch of options:
>
> A. Allocate a *single* fresh native MS in a given session.
> B. Allocate a *pair* of MS’s, with the string bodies in a native MS 
> in a given session.
> C. Store the data into one or two given MS’s at one or two a given 
> offsets, etc.
>
> And the count can similarly be represented:
>
> 0. return the count of strings encoded and do nothing more, or
> 1. return a segment whose upper bound is after the last array element, 
> or
> 2. encode an extra empty `NULL` element at the end of the output array
>
> The options C above are tricky but some applications may choose to 
> embrace the complexity in order to reduce end-to-end copying.  That in 
> turn suggests that maybe someone should create a library to manage 
> blocks of working storage that accumulate data structures for eventual 
> posting to other native code.
>
> A very highly developed framework might also emphasize pointer-free 
> and/or position-independent data structures.  The various structured 
> packet frameworks do this, one way or another.  My favorite is 
> https://capnproto.org !


More information about the valhalla-dev mailing list