a Saturday puzzler: streaming over variable-length data

Sun Dec 12 01:00:30 UTC 2021

Here’s a puzzler (actually a family of puzzlers) that occurred to me.

Suppose I, as a Java and Panama programmer, need to communicate arrays 
of strings with native code.

To be specific, let’s talk only about null-terminated UTF8 strings (on 
the native side).

How do I make streams that encode and decode these?  They should be 
concise and efficient.  The streams should be parallel-capable, if 
possible.  Maybe there’s a library of utilities I’m missing…?

(To be very very general, all of the rest of this discussion uses UTF8 
strings as a “for instance” example, and in fact any kind of 
self-delimiting variable length data would be about as interesting and 
informative.  For example, var-ints in the classic form of “bit seven 
means read more bytes after this one”.  Moreover, if the data is 
self-synchronizing, as with strings and var-ints, you can write a 
spliterator over it for parallel stream processing.)

So, how do I make a stream over a memory segment of type `char*` that 
consists of a series of zero-terminated UTF8 strings, back to back?  The 
stream should produce a series of `String` objects.

The answers differ a little depending on loop termination:

0. start with a predefined count of the strings to decode, or
1. keep going right up to the upper bound of the segment, or
2. stop when an empty string (a pair of null bytes) is found.

Bonus points for avoiding double scanning of the strings.  This means 
`MS::getUtf8String` is not necessarily the best tool.  (But what is, 
then?)

Second puzzler:  How to do the whole thing backwards?  That is, convert 
a stream of Java strings back into a memory segment containing the UTF8 
string bodies concatenated with trailing nulls.

The result is disposed of one of these ways:

A. Allocate a fresh native MS in a given session.
B. Allocate a fresh heap MS in the global session.
C. Store the data into a given MS at a given offset, returning the new 
offset, and indicating if there is more that didn’t fit.  (And 
allowing restart at that offset, for recovery code.)

The number of converted strings is also reported, corresponding to the 
above options:

0. return the count of strings encoded and do nothing more, or
1. return a segment whose upper bound is after the last string’s null, 
or
2. encode an extra empty string (a pair of null bytes) at the end.

And two more puzzlers pop into mind, for an argv/envp array, of type 
`char**`.  Here, I suppose that the stream that reads the things will 
walk over a pointer to the `char**` array and read each `char*` item, 
decoding as it goes.

Again the number of items to decode can be determined

0. with a predefined array-length for the strings to decode, or
1. read to the upper bound of the segment holding the array, or
2. stop when a `NULL` array element is found.

For the reverse encoding there are a bunch of options:

A. Allocate a *single* fresh native MS in a given session.
B. Allocate a *pair* of MS’s, with the string bodies in a native MS in 
a given session.
C. Store the data into one or two given MS’s at one or two a given 
offsets, etc.

And the count can similarly be represented:

0. return the count of strings encoded and do nothing more, or
1. return a segment whose upper bound is after the last array element, 
or
2. encode an extra empty `NULL` element at the end of the output array

The options C above are tricky but some applications may choose to 
embrace the complexity in order to reduce end-to-end copying.  That in 
turn suggests that maybe someone should create a library to manage 
blocks of working storage that accumulate data structures for eventual 
posting to other native code.

A very highly developed framework might also emphasize pointer-free 
and/or position-independent data structures.  The various structured 
packet frameworks do this, one way or another.  My favorite is 
https://capnproto.org !