Design for collections upgrades (was: Re: lambda-dev Digest, Vol 15, Issue 20 [reduce method result type])

Sat Mar 12 17:30:58 PST 2011

I've been re-reading the "Design for collection upgrades" thread, and had
some thoughts on the nature of the Stream abstraction (for lazy/serial
scenario) as originally outlined in Brian's post that kicked off the thread,
and users expectations of what comes with that name.

Per the proposal, it sounds like Stream is more or less a marker interface
to indicate lazy operations, and not otherwise any different from Iterable.
However, this isn't what "stream" has historically meant in imperative PLs
and their frameworks. For some examples of what I mean, let me point at
java.io.InputStream, or at C++ std::basic_istream. In both cases, the
fundamental property of the streams - the one that is guaranteed to be
supported by _any_ stream - is that you can read elements. InputStream also
offers mark()/reset(), but those are conditionally supported, and so are not
part of the basic contract of the abstraction; they could (one could even
argue that, from a pure OOD approach, they should) just as well be moved to
a more concrete interface that could be queried by clients instead of using
markSupported().

The important thing with that basic contract is that once a value is read
from a stream, it stays read: a stream by itself is a read-once sequence.
There may be ways to obtain another stream that would re-read the same exact
elements, but that would still be a new stream object. For streams that wrap
some data store (e.g. FileInputStream or ByteArrayInputStream), the stream
object is essentially a cursor into the store - it has a current position,
which every read advances towards the end of the store. Furthermore, the
stream is not _the_ store - you can have several FileInputStreams over the
same file, or several ByteArrayInputStreams sharing the same byte array.

Now if we take the above and see how it applies to collections, it actually
is a very familiar concept: something that is not a collection itself, but
is a forward-only cursor for a collection, and each collection may have more
than one such cursor - why, that is Iterator. Its next() and hasNext()
methods match exactly the basic input stream contract; remove() does not,
but it is conditionally-supported anyway. An analogy to highlight this
point: Iterator is to Iterable/Collection what ByteArrayInputStream is to
byte array.

Iterators, being cursors, are also naturally expected to be lazy by API
clients - if I provide a method that takes an iterator, applies a filter to
it, and returns another iterator as a result, no-one would expect filtering
to occur over the entire collection right there and then. It's clear that
the result of such operation is a "filtered iterator", that would apply the
filter as it is being iterated.

So, then, why not put lazy map/filter/reduce/... on Iterator? Thus, Brian's
original serial/lazy example would become:

    maxFooWeight = things.iterator()
                              .filter(#Thing.isFoo)
                              .map(#Thing.getWeight)
                              .max();

or maybe with some less concise but more descriptive (and - purely
subjectively - "Java-like") names:

    maxFooWeight = things.iterator()
                              .withFilter(#Thing.isFoo)
                              .withTransform(#Thing.getWeight)
                              .getLargestValue();

This has a nice property of working with a more fundamental abstraction than
collections - I can write an Iterator that wraps a non-rewindable I/O
InputStream, but I cannot write such an Iterable (well, I can - if it throws
after the first call to iterator() - but it will only be usable with APIs
which specify that they only call iterator() on the provided Iterable once
as part of their public contract; otherwise, I'm relying on an
implementation detail).

The only major annoyance I can see with this approach is that
enhanced-for-loop only supports Iterables (and arrays) but not Iterators,
and so you'd have to write a manual loop with next()/hasNext() to iterate
over the result. But is there any reason why enhanced-for cannot be made to
support Iterators directly? The only thing it does to the provided Iterable
is to call iterator() on it, and it does it exactly once; it would just need
to use the provided Iterator directly instead. It sounds like it would be
trivial to add.

The above didn't touch on what the design would look like for eager or
lazy/parallel operations. For parallel, the original design can be trivially
adapted by moving asParallel() to Iterator directly, and producing some sort
of ParallelIterator, which is simply a marker interface to enable
parallelization for all applied operations, but otherwise is the same as
Iterator (probably just extending it).

For eager operations, I would prefer in-place mutating methods returning the
receiver (to permit chaining), with distinct but obviously mapped names. For
example, if Iterator has withFilter(), then Collection would have filter().
I don't think there is much utility in having eager ops that do a copy, and
even less so for chaining such. I think a simple addition of something like
clone() to Collection (default implementation could do newInstance()
assuming a no-arg constructor available, and then addAll(this) on that new
instance) would cover vast majority of all cases where you want to get a
copy: the usual pattern would then be to do something like:

   Collection<int> getWeights() {
       return things.clone().filter(#Thing.isFoo).
transform(#Thing.getWeight);
   }

where both filter() and map() operate in-place on the cloned collection.
This also skirts the whole question of the type of the resulting collection
produced by a copying operation - it's clear and unambiguous what it'll be
for clone(), and no other operation makes a copy.

On Tue, Mar 8, 2011 at 9:23 AM, Brian Goetz <brian.goetz at oracle.com> wrote:

> Since people are already discussing this based on an experimental
> checkin, let me outline the big picture plan here.
>
> The general idea is to add functional-like operations to collections --
> filter, map, reduce, apply.
>
> I see three sensible modes, with explicit choices of which you get.
>
> 1.  Serial / Eager.  This is the straight
> collections-with-functional-style mode, and some samples have already
> been checked in as proof of concept.  Operations on collections yield
> new collections, and you can chain the calls.  It values ease of use
> over performance (no new concepts like laziness), but the performance
> model is still highly predictable.  You get things like
>
>      Collection fooAbles = things.filter( #{ t -> t.isFoo() });
>
> or, with method references:
>
>      Collection fooAbles = things.filter(#Thing.isFoo); // ooh, pretty
>
> You can also chain calls together, though you pay a (predictable)
> performance cost of intermediate collections, which for small
> collections is unlikely to matter:
>
>      maxFooWeight = things.filter(#Thing.isFoo)
>                           .map(#Thing.getWeight)
>                           .max();
>
> The benefit here is concision and clarity.  The cost is some
> performance, but maybe not so much that people freak out.  If people
> care, they move to the next model, which is:
>
> 2.  Serial / Lazy.  Here, the primary abstraction is Stream (name to be
> chosen later, Remi used "lazy" in his example.)  To transfer between
> "eager world" and "lazy world", you use conversion methods (toStream /
> toCollection).  A typical call chain probably looks like:
>   collection.toStream / op / op / op / {toCollection,reduce,apply}
>
> so the above example becomes
>
>      maxFooWeight = things.asStream()
>                           .filter(#Thing.isFoo)
>                           .map(#Thing.getWeight)
>                           .max();
>
> The return type of Collection.filter is different from the return type
> of Stream.filter, so the choice and performance costs are reflected in
> the static type system.  This avoids the cost of the intermediate
> collections, but is still serial.  If you care about that, you move up
> to the next model, which is:
>
> 3.  Parallel / Lazy.  Here, the primary abstraction is something like
> ParallelStream or ParallelIterable.  Let's call it ParallelFoo to avoid
> bikeshedding for the moment.  Now, the code looks like:
>
>      maxFooWeight = things.asParallelFoo()
>                           .filter(#Thing.isFoo)
>                           .map(#Thing.getWeight)
>                           .max();
>
> Again, the return type of ParallelFoo.filter is different from
> Stream.filter or Collection.filter, so again the choice is reflected in
> the static type system.  But you don't have to rewrite your code.
>
> The beauty here is twofold:
>
>  - The base model (serial/eager) is easy to understand and natural to
> use as a way of expressing what the programmer wants to do, and
> attractive enough to stand on its own -- just a little slow with big
> collections.
>  - Switching between execution models is mostly a matter of adding an
> explicit conversion or two in the call chain, as the models are similar
> enough that the rest of the code should still work (and even mean the
> same thing.)
>
>
> On 3/8/2011 8:43 AM, Rémi Forax wrote:
> >    Le 08/03/2011 14:31, Jim Mayer a écrit :
> >> // I can tolerate this one
> >>       long product(List<Integer>   list) {
> >>         return list.map(#{x ->   (long) x}).reduce(0L, #{sum, x ->   sum
> * x});
> >>       }
> >
> > I prefer this one:
> >
> >     long product(List<Integer>   list) {
> >         return list.lazy().map(#{x ->   (long) x}).reduce(0L, #{sum, x ->
>   sum * x});
> >     }
> >
> > lazy() means, don't do map directly, but wait and do map and reduce in
> > one iteration.
> >
> > Rémi
> >
> >
>
>