<div dir="ltr">I have created some drafts of the EnumeratedStream API. But firstly, I have noticed that we have not formalized problems we are solving with them. Moreover, during the implementation process, I have discovered that there are possibly a much wider range of applications for those streams. Instead of just enumerating streams, those streams could be adapted to supply any kind of metadata along with the object itself, which would provide much more opportunities and EnumeratedStreams could be just one of possible implementations of this API. So let's point out problems solved by the new API.<div><br></div><div>For enumerated streams specifically:</div><div>1. Data-oriented design. Data oriented
design is a programming paradigm that is applied when a system operates on large amounts of data. In this paradigm objects are basically being split into "dimensions", which are basically 1d collections of values List of collections of values together represent a list of objects. You can think of this as a table where columns are 1d collections (dimensions), and rows are objects. For this approach stream enumeration is crucial to be able to access value from another dimension "on the same row". It could seem that data-oriented programming is fairly rare to be applied, but, actually, this is the main approach to work with large amounts of data. For example, pandas in python, to the best of my knowledge, implement a data-oriented approach.</div><div>2. Index lookup operations. While I still stand my ground that List.findIndex is a preferred approach, enumerated streams could also be used for things like "find all matching indexes in list".</div><div>3. Working with distances and hotspots. I am not really familiar with this, but David Alayachew in a related thread has explained it like so (quoting below):</div><div><br><span style="font-family:monospace">2. Building jumps/skips -- Fetching the indexes at hotspots, and then using them to create a skip-list-esque structure between the hotspots. Very useful for realigning search strategies.</span><br style="font-family:monospace"><span style="font-family:monospace">3. Calculating distance using index -- This is a sister concept to bullet 1. Let's say I built my skip list above, and have determined that my desired target is not in the hotspots. I can make a better decision to go parallel or sequential by seeing the distance between hotspots. That distance would be taken by subtracting hotspot index by hotspot index.\</span><br></div><div><br></div><div>End of quote.<br></div><div><br></div><div>I am sure there is more to it, but for now, let's get over this section.</div><div><br></div><div>As for metadata-supplying streams in general:</div><div>1. Replacing visitor pattern, Not necessarily streams, but fluent API in general are widely used to provide a better alternative for Visitor pattern. Such metadata could include current path, node position etc.</div><div>2. Any complex custom data processing pipeline. Metadata could represent some data that is evaluated based on a processed object. Such processing pipelines could extend the hypothetical "MetadataPipeline", implement a method that provides metadata based on an object that is processed and enjoy all benefits of fluent data processing zipped with metadata for it.</div><div>3. Virtually any kind of metadata-aware operations. Timestamps, geospatial data, tagging/classifying, possible stateful operations on data, source or origin info, quality?confidence scores etc. I think there is application for this in virtually every field.</div><div><br></div><div>Now that we described problems being solved. Let's move on to implementation details. There were discussions on a gatherer-based vs new-pipeline-type-based approach. I have implemented 6 operations (map, flatMap, filter, peek, takeWhile, dropWhile) both ways and here are some of my thoughts.</div><div><br></div><div>1. Complexity of implementation. Gatherers, without a doubt, were much easier to implement. For now class-based streams don't even implement parallel evaluation and it will be really hard to implement parallel evaluation in a way that enumerates elements in order of their occurence,</div><div>2. Performance. Both approaches demonstrate fairly similar performance. Gatherers outperform for 10-20% with findAny() terminal operation, while for toList() terminal operation class-based approach is usually 10-20% faster. Im sure my code could use A LOT of optimization, but this could give initial understanding of performance of both approaches. As for memory usage, the gatherer-based approach usually consumes up to 2 times more memory. This is just initial lightweight benchmark results, more comprehensive results with visualization will come in a few days after more complex benchmarking completes.</div><div>3. Preserving indexes. Expectedly, gatherers do not preserve indexes for further operations. Imo, this is what kills any benefits of this approach. This just removes a major part of potential use cases. Things like "find all matching indexes are still possible, but in an undesired (as I think) way. Consider following:</div><div><br></div><div>With stream-based approach:</div><div>list,stream().filter((i, x) -> ...).map((i, _) -> i)</div><div><br></div><div>With gatherer-based:</div><div>list.stream().gather(Gathrers.mapMultiEnumerated((sink, i, x) -> { if (predicate.test(i, x) sink.accept(i))</div><div><br></div><div>As you see, this first is clean and readable, while the second is merging a few steps of processing into one which is clearly undesirable.</div><div><br></div><div>This is omittable using something like zipWithIndex() gather operation and then using a stream of zipped objects, but unless there is a value classes, this will introduce performance issues and (as i think) not the most friendly and discoverable API.</div><div><br></div><div>4. Terminal operations. While gatherers could be "effectively terminal", there are few issues with this. Not even considering the fact that gatherers initially were designed for intermediate operations, and terminal operations would be much suitable for collectors if they were able to short circuit, for "effectively terminal" gatherers to be executed, terminal operation still has to be called, and will require some resources anyway, There is no terminal operation that just closes the stream as far as I know. Therefore, operations like forEachEnumerated or reduceEnumerated using gatherers will have some performance issues and not the most obvious API for people that aren't familiar with how deferred evaluation works.</div><div><br></div><div>Would appreciate any thoughts on what I have written above.If someone is willing to help me in development, here is a link to github repo`s branch: <a href="https://github.com/Evemose/jdk/tree/enumerated-streams">https://github.com/Evemose/jdk/tree/enumerated-streams</a></div><div><br></div><div>Best regards.</div><div><br></div><div><br></div><div><span style="font-family:monospace"><br></span></div><table cellpadding="0" class="gmail-cf gmail-gJ" style="border-collapse:collapse;margin-top:0px;width:auto;font-family:"Google Sans",Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px;display:block"><tbody style="display:block"></tbody></table></div>