Design for collections upgrades (was: Re: lambda-dev Digest, Vol 15, Issue 20 [reduce method result type])

Sat Mar 12 19:53:57 PST 2011

I don't feel like a comparison to InputStream etc. makes much sense
considering "Stream" is an arbitrary name that can easily be changed to
something more appropriate ("Streamable", say). I also think it's a weakness
of the IO streams that they don't have an Iterable equivalent (like
InputSupplier from Guava). Regardless, limiting lazy filter/map to Iterator
would be... undesirable. I do think Iterator should have such methods as
well, though.

-- 
Colin

On Sat, Mar 12, 2011 at 8:30 PM, Pavel Minaev <int19h at gmail.com> wrote:

> I've been re-reading the "Design for collection upgrades" thread, and had
> some thoughts on the nature of the Stream abstraction (for lazy/serial
> scenario) as originally outlined in Brian's post that kicked off the
> thread,
> and users expectations of what comes with that name.
>
> Per the proposal, it sounds like Stream is more or less a marker interface
> to indicate lazy operations, and not otherwise any different from Iterable.
> However, this isn't what "stream" has historically meant in imperative PLs
> and their frameworks. For some examples of what I mean, let me point at
> java.io.InputStream, or at C++ std::basic_istream. In both cases, the
> fundamental property of the streams - the one that is guaranteed to be
> supported by _any_ stream - is that you can read elements. InputStream also
> offers mark()/reset(), but those are conditionally supported, and so are
> not
> part of the basic contract of the abstraction; they could (one could even
> argue that, from a pure OOD approach, they should) just as well be moved to
> a more concrete interface that could be queried by clients instead of using
> markSupported().
>
> The important thing with that basic contract is that once a value is read
> from a stream, it stays read: a stream by itself is a read-once sequence.
> There may be ways to obtain another stream that would re-read the same
> exact
> elements, but that would still be a new stream object. For streams that
> wrap
> some data store (e.g. FileInputStream or ByteArrayInputStream), the stream
> object is essentially a cursor into the store - it has a current position,
> which every read advances towards the end of the store. Furthermore, the
> stream is not _the_ store - you can have several FileInputStreams over the
> same file, or several ByteArrayInputStreams sharing the same byte array.
>
> Now if we take the above and see how it applies to collections, it actually
> is a very familiar concept: something that is not a collection itself, but
> is a forward-only cursor for a collection, and each collection may have
> more
> than one such cursor - why, that is Iterator. Its next() and hasNext()
> methods match exactly the basic input stream contract; remove() does not,
> but it is conditionally-supported anyway. An analogy to highlight this
> point: Iterator is to Iterable/Collection what ByteArrayInputStream is to
> byte array.
>
> Iterators, being cursors, are also naturally expected to be lazy by API
> clients - if I provide a method that takes an iterator, applies a filter to
> it, and returns another iterator as a result, no-one would expect filtering
> to occur over the entire collection right there and then. It's clear that
> the result of such operation is a "filtered iterator", that would apply the
> filter as it is being iterated.
>
> So, then, why not put lazy map/filter/reduce/... on Iterator? Thus, Brian's
> original serial/lazy example would become:
>
>    maxFooWeight = things.iterator()
>                               .filter(#Thing.isFoo)
>                              .map(#Thing.getWeight)
>                              .max();
>
> or maybe with some less concise but more descriptive (and - purely
> subjectively - "Java-like") names:
>
>    maxFooWeight = things.iterator()
>                              .withFilter(#Thing.isFoo)
>                              .withTransform(#Thing.getWeight)
>                              .getLargestValue();
>
> This has a nice property of working with a more fundamental abstraction
> than
> collections - I can write an Iterator that wraps a non-rewindable I/O
> InputStream, but I cannot write such an Iterable (well, I can - if it
> throws
> after the first call to iterator() - but it will only be usable with APIs
> which specify that they only call iterator() on the provided Iterable once
> as part of their public contract; otherwise, I'm relying on an
> implementation detail).
>
> The only major annoyance I can see with this approach is that
> enhanced-for-loop only supports Iterables (and arrays) but not Iterators,
> and so you'd have to write a manual loop with next()/hasNext() to iterate
> over the result. But is there any reason why enhanced-for cannot be made to
> support Iterators directly? The only thing it does to the provided Iterable
> is to call iterator() on it, and it does it exactly once; it would just
> need
> to use the provided Iterator directly instead. It sounds like it would be
> trivial to add.
>
>
> The above didn't touch on what the design would look like for eager or
> lazy/parallel operations. For parallel, the original design can be
> trivially
> adapted by moving asParallel() to Iterator directly, and producing some
> sort
> of ParallelIterator, which is simply a marker interface to enable
> parallelization for all applied operations, but otherwise is the same as
> Iterator (probably just extending it).
>
> For eager operations, I would prefer in-place mutating methods returning
> the
> receiver (to permit chaining), with distinct but obviously mapped names.
> For
> example, if Iterator has withFilter(), then Collection would have filter().
> I don't think there is much utility in having eager ops that do a copy, and
> even less so for chaining such. I think a simple addition of something like
> clone() to Collection (default implementation could do newInstance()
> assuming a no-arg constructor available, and then addAll(this) on that new
> instance) would cover vast majority of all cases where you want to get a
> copy: the usual pattern would then be to do something like:
>
>   Collection<int> getWeights() {
>       return things.clone().filter(#Thing.isFoo).
> transform(#Thing.getWeight);
>   }
>
> where both filter() and map() operate in-place on the cloned collection.
> This also skirts the whole question of the type of the resulting collection
> produced by a copying operation - it's clear and unambiguous what it'll be
> for clone(), and no other operation makes a copy.
>
>
> On Tue, Mar 8, 2011 at 9:23 AM, Brian Goetz <brian.goetz at oracle.com>
> wrote:
>
> > Since people are already discussing this based on an experimental
> > checkin, let me outline the big picture plan here.
> >
> > The general idea is to add functional-like operations to collections --
> > filter, map, reduce, apply.
> >
> > I see three sensible modes, with explicit choices of which you get.
> >
> > 1.  Serial / Eager.  This is the straight
> > collections-with-functional-style mode, and some samples have already
> > been checked in as proof of concept.  Operations on collections yield
> > new collections, and you can chain the calls.  It values ease of use
> > over performance (no new concepts like laziness), but the performance
> > model is still highly predictable.  You get things like
> >
> >      Collection fooAbles = things.filter( #{ t -> t.isFoo() });
> >
> > or, with method references:
> >
> >      Collection fooAbles = things.filter(#Thing.isFoo); // ooh, pretty
> >
> > You can also chain calls together, though you pay a (predictable)
> > performance cost of intermediate collections, which for small
> > collections is unlikely to matter:
> >
> >      maxFooWeight = things.filter(#Thing.isFoo)
> >                           .map(#Thing.getWeight)
> >                           .max();
> >
> > The benefit here is concision and clarity.  The cost is some
> > performance, but maybe not so much that people freak out.  If people
> > care, they move to the next model, which is:
> >
> > 2.  Serial / Lazy.  Here, the primary abstraction is Stream (name to be
> > chosen later, Remi used "lazy" in his example.)  To transfer between
> > "eager world" and "lazy world", you use conversion methods (toStream /
> > toCollection).  A typical call chain probably looks like:
> >   collection.toStream / op / op / op / {toCollection,reduce,apply}
> >
> > so the above example becomes
> >
> >      maxFooWeight = things.asStream()
> >                           .filter(#Thing.isFoo)
> >                           .map(#Thing.getWeight)
> >                           .max();
> >
> > The return type of Collection.filter is different from the return type
> > of Stream.filter, so the choice and performance costs are reflected in
> > the static type system.  This avoids the cost of the intermediate
> > collections, but is still serial.  If you care about that, you move up
> > to the next model, which is:
> >
> > 3.  Parallel / Lazy.  Here, the primary abstraction is something like
> > ParallelStream or ParallelIterable.  Let's call it ParallelFoo to avoid
> > bikeshedding for the moment.  Now, the code looks like:
> >
> >      maxFooWeight = things.asParallelFoo()
> >                           .filter(#Thing.isFoo)
> >                           .map(#Thing.getWeight)
> >                           .max();
> >
> > Again, the return type of ParallelFoo.filter is different from
> > Stream.filter or Collection.filter, so again the choice is reflected in
> > the static type system.  But you don't have to rewrite your code.
> >
> > The beauty here is twofold:
> >
> >  - The base model (serial/eager) is easy to understand and natural to
> > use as a way of expressing what the programmer wants to do, and
> > attractive enough to stand on its own -- just a little slow with big
> > collections.
> >  - Switching between execution models is mostly a matter of adding an
> > explicit conversion or two in the call chain, as the models are similar
> > enough that the rest of the code should still work (and even mean the
> > same thing.)
> >
> >
> > On 3/8/2011 8:43 AM, Rémi Forax wrote:
> > >    Le 08/03/2011 14:31, Jim Mayer a écrit :
> > >> // I can tolerate this one
> > >>       long product(List<Integer>   list) {
> > >>         return list.map(#{x ->   (long) x}).reduce(0L, #{sum, x ->
> sum
> > * x});
> > >>       }
> > >
> > > I prefer this one:
> > >
> > >     long product(List<Integer>   list) {
> > >         return list.lazy().map(#{x ->   (long) x}).reduce(0L, #{sum, x
> ->
> >   sum * x});
> > >     }
> > >
> > > lazy() means, don't do map directly, but wait and do map and reduce in
> > > one iteration.
> > >
> > > Rémi
> > >
> > >
> >
> >
>
>