a Saturday puzzler: streaming over variable-length data
John Rose
john.r.rose at oracle.com
Sun Dec 12 01:00:30 UTC 2021
Here’s a puzzler (actually a family of puzzlers) that occurred to me.
Suppose I, as a Java and Panama programmer, need to communicate arrays
of strings with native code.
To be specific, let’s talk only about null-terminated UTF8 strings (on
the native side).
How do I make streams that encode and decode these? They should be
concise and efficient. The streams should be parallel-capable, if
possible. Maybe there’s a library of utilities I’m missing…?
(To be very very general, all of the rest of this discussion uses UTF8
strings as a “for instance” example, and in fact any kind of
self-delimiting variable length data would be about as interesting and
informative. For example, var-ints in the classic form of “bit seven
means read more bytes after this one”. Moreover, if the data is
self-synchronizing, as with strings and var-ints, you can write a
spliterator over it for parallel stream processing.)
So, how do I make a stream over a memory segment of type `char*` that
consists of a series of zero-terminated UTF8 strings, back to back? The
stream should produce a series of `String` objects.
The answers differ a little depending on loop termination:
0. start with a predefined count of the strings to decode, or
1. keep going right up to the upper bound of the segment, or
2. stop when an empty string (a pair of null bytes) is found.
Bonus points for avoiding double scanning of the strings. This means
`MS::getUtf8String` is not necessarily the best tool. (But what is,
then?)
Second puzzler: How to do the whole thing backwards? That is, convert
a stream of Java strings back into a memory segment containing the UTF8
string bodies concatenated with trailing nulls.
The result is disposed of one of these ways:
A. Allocate a fresh native MS in a given session.
B. Allocate a fresh heap MS in the global session.
C. Store the data into a given MS at a given offset, returning the new
offset, and indicating if there is more that didn’t fit. (And
allowing restart at that offset, for recovery code.)
The number of converted strings is also reported, corresponding to the
above options:
0. return the count of strings encoded and do nothing more, or
1. return a segment whose upper bound is after the last string’s null,
or
2. encode an extra empty string (a pair of null bytes) at the end.
And two more puzzlers pop into mind, for an argv/envp array, of type
`char**`. Here, I suppose that the stream that reads the things will
walk over a pointer to the `char**` array and read each `char*` item,
decoding as it goes.
Again the number of items to decode can be determined
0. with a predefined array-length for the strings to decode, or
1. read to the upper bound of the segment holding the array, or
2. stop when a `NULL` array element is found.
For the reverse encoding there are a bunch of options:
A. Allocate a *single* fresh native MS in a given session.
B. Allocate a *pair* of MS’s, with the string bodies in a native MS in
a given session.
C. Store the data into one or two given MS’s at one or two a given
offsets, etc.
And the count can similarly be represented:
0. return the count of strings encoded and do nothing more, or
1. return a segment whose upper bound is after the last array element,
or
2. encode an extra empty `NULL` element at the end of the output array
The options C above are tricky but some applications may choose to
embrace the complexity in order to reduce end-to-end copying. That in
turn suggests that maybe someone should create a library to manage
blocks of working storage that accumulate data structures for eventual
posting to other native code.
A very highly developed framework might also emphasize pointer-free
and/or position-independent data structures. The various structured
packet frameworks do this, one way or another. My favorite is
https://capnproto.org !
More information about the panama-dev
mailing list