Enumerated streams

Olexandr Rotan rotanolexandr842 at gmail.com
Tue May 7 21:29:38 UTC 2024


I have created some drafts of the EnumeratedStream API. But firstly, I have
noticed that we have not formalized problems we are solving with them.
Moreover, during the implementation process, I have discovered that there
are possibly a much wider range of applications for those streams. Instead
of just enumerating streams, those streams could be adapted to supply any
kind of metadata along with the object itself, which would provide much
more opportunities and EnumeratedStreams could be just one of possible
implementations of this API. So let's point out problems solved by the new
API.

For enumerated streams specifically:
1. Data-oriented design. Data oriented  design  is a programming paradigm
that is applied when a system operates on large amounts of data. In this
paradigm objects are basically being split into "dimensions", which are
basically 1d collections of values List of collections of values together
represent a list of objects. You can think of this as a table where columns
are 1d collections (dimensions), and rows are objects. For this approach
stream enumeration is crucial to be able to access value from another
dimension "on the same row". It could seem that data-oriented programming
is fairly rare to be applied, but, actually, this is the main approach to
work with large amounts of data. For example, pandas in python, to the best
of my knowledge, implement a data-oriented approach.
2. Index lookup operations. While I still stand my ground that
List.findIndex is a preferred approach, enumerated streams could also be
used for things like "find all matching indexes in list".
3. Working with distances and hotspots. I am not really familiar
with this, but David Alayachew in a related thread has explained it like so
(quoting below):

2. Building jumps/skips -- Fetching the indexes at hotspots, and then using
them to create a skip-list-esque structure between the hotspots. Very
useful for realigning search strategies.
3. Calculating distance using index -- This is a sister concept to bullet
1. Let's say I built my skip list above, and have determined that my
desired target is not in the hotspots. I can make a better decision to go
parallel or sequential by seeing the distance between hotspots. That
distance would be taken by subtracting hotspot index by hotspot index.\

End of quote.

I am sure there is more to it, but for now, let's get over this section.

As for metadata-supplying streams in general:
1. Replacing visitor pattern, Not necessarily streams, but fluent API in
general are widely used to provide a better alternative for Visitor
pattern. Such metadata could include current path, node position etc.
2. Any complex custom data processing pipeline. Metadata could represent
some data that is evaluated based on a processed object. Such processing
pipelines could extend the hypothetical "MetadataPipeline", implement a
method that provides metadata based on an object that is processed and
enjoy all benefits of fluent data processing zipped with metadata for it.
3. Virtually any kind of metadata-aware operations. Timestamps, geospatial
data, tagging/classifying, possible stateful operations on data, source or
origin info, quality?confidence scores etc. I think there is application
for this in virtually every field.

Now that we described problems being solved. Let's move on to
implementation details. There were discussions on a gatherer-based vs
new-pipeline-type-based approach. I have implemented 6 operations (map,
flatMap, filter, peek, takeWhile, dropWhile) both ways and here are some of
my thoughts.

1. Complexity of implementation. Gatherers, without a doubt, were much
easier to implement. For now class-based streams don't even implement
parallel evaluation and it will be really hard to implement parallel
evaluation in a way that enumerates elements in order of their occurence,
2. Performance. Both approaches demonstrate fairly similar performance.
Gatherers outperform for 10-20% with findAny() terminal operation, while
for toList() terminal operation class-based approach is usually 10-20%
faster. Im sure my code could use A LOT of optimization, but this could
give initial understanding of performance of both approaches. As for
memory usage, the gatherer-based approach usually consumes up to 2 times
more memory. This is just initial lightweight benchmark results, more
comprehensive results with visualization will come in a few days after more
complex benchmarking completes.
3. Preserving indexes. Expectedly, gatherers do not preserve indexes for
further operations. Imo, this is what kills any benefits of this approach.
This just removes a major part of potential use cases. Things like "find
all matching indexes are still possible, but in an undesired (as I think)
way. Consider following:

With stream-based approach:
list,stream().filter((i, x) -> ...).map((i, _) -> i)

With gatherer-based:
list.stream().gather(Gathrers.mapMultiEnumerated((sink, i, x) -> { if
(predicate.test(i, x) sink.accept(i))

As you see, this first is clean and readable, while the second is merging a
few steps of processing into one which is clearly undesirable.

This is omittable using something like zipWithIndex() gather operation and
then using a stream of zipped objects, but unless there is a value classes,
this will introduce performance issues and (as i think) not the most
friendly and discoverable API.

4. Terminal operations. While gatherers could be "effectively terminal",
there are few issues with this. Not even considering the fact that
gatherers initially were designed for intermediate operations, and terminal
operations would be much suitable for collectors if they were able to short
circuit, for "effectively terminal" gatherers to be executed, terminal
operation still has to be called, and will require some resources anyway,
There is no terminal operation that just closes the stream as far as I
know. Therefore, operations like forEachEnumerated or reduceEnumerated
using gatherers will have some performance issues and not the most obvious
API for people that aren't familiar with how deferred evaluation works.

Would appreciate any thoughts on what I have written above.If someone is
willing to help me in development, here is a link to github repo`s branch:
https://github.com/Evemose/jdk/tree/enumerated-streams

Best regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20240508/652e1975/attachment.htm>


More information about the core-libs-dev mailing list