Design for collections upgrades (was: Re: lambda-dev Digest, Vol 15, Issue 20 [reduce method result type])

Sun Mar 13 10:28:39 PDT 2011

I absolutely agree that you shouldn't make an Iterable that can only be used
once. Why do you think you can't have a reusable Iterable that represents an
infinite series or a large data set though? If you can create an Iterator
representing an infinite series, why couldn't you create that Iterator more
than once (in which case you can have an Iterable)?

-- 
Colin

On Sun, Mar 13, 2011 at 1:24 PM, Jim Mayer <jim at pentastich.org> wrote:

> Yes, you can "do" that with an iterable, but I don't like the idea of an
> iterable that can only be used once.  Most, if not all, classes that
> implement Iterable today have the property that you can call 'iterator'
> multiple times to get independent iterators.
>
> -- Jim
>
>
> On Sun, Mar 13, 2011 at 1:09 PM, Colin Decker <cgdecker at gmail.com> wrote:
>
>> Iterable can represent an infinite series or a very large data set just as
>> well as Iterator, given that it is just a factory for Iterators... it isn't
>> tied to Collections. Obviously there are some cases where you just want an
>> Iterator, and there's no reason that lazy filtering, mapping, etc. can't be
>> done on an Iterator as well.
>>
>> --
>> Colin
>>
>>
>>
>> On Sun, Mar 13, 2011 at 12:38 PM, Jim Mayer <jim at pentastich.org> wrote:
>>
>>> I have a different take on this.  Non-reusable streams (e.g., Iterator,
>>> not Iterable) can be used to represent data structures that cannot be easily
>>> regenerated.  Infinite series come to mind, as well as data sets that are
>>> very large or come from non-determinate sources.  Since lazily evaluated
>>> streams can represent these gracefully and collections cannot (what do you
>>> return for the size of a infinte series?), why cripple the stream
>>> abstraction by tying it to collections? On the other hand, if a stream is
>>> backed by a collection, one can simply call "asStream()" again to get a
>>> "rewound" series.
>>>
>>> So, instead of using Iterable, I'd prefer something like this:
>>>
>>> (1) Collections sport an "asStream" (or even the existing "iterator()"
>>> method, if Iterator turns out to be flexible enough).
>>> (2) That whatever class/interface is the result of "asStream" support an
>>> API for creating instances outside of the Collections framework (note that
>>> Iterator already does this).
>>> (3) That we consider making one of the operators something that can
>>> "truncate" an infinite series.
>>>
>>> With this kind of framework one could have the same kind of thing we've
>>> been talking about:
>>>
>>>     int sum(Collection<Integer> s) {
>>>         return s.asStream().fold(#{sum, x -> sum + x}, 0);
>>>     }
>>>
>>> but could also have:
>>>
>>>     int sum(BufferedReader r) {
>>>         return r.asStream(r#readLine).map(Integer#valueOf).fold(#{sum,
>>> x-> sum + x}, 0);
>>>     }
>>>
>>> or even:
>>>
>>>     int sum(StreamTypeThing s, int count) {
>>>         return s.prefix(count).fold(#{sum, x -> sum + x}, 0);
>>>     }
>>>
>>> As a side note, I agree with one of the earlier posters that if we did
>>> something like this we'd want to allow stuff like:
>>>
>>>     for (String x : c.asStream().map(...)) {}
>>>
>>> -- Jim
>>>
>>> On Sat, Mar 12, 2011 at 11:37 PM, Colin Decker <cgdecker at gmail.com>wrote:
>>>
>>>> Because there's no need to and because Iterables (being reusable) are
>>>> far
>>>> more flexible than Iterators. The Iterable interface is just as lazy as
>>>> Iterator. The Stream interface should be Iterable.
>>>>
>>>> --
>>>> Colin
>>>>
>>>>
>>>> On Sat, Mar 12, 2011 at 11:08 PM, Pavel Minaev <int19h at gmail.com>
>>>> wrote:
>>>>
>>>> > I'm curious as to why you think that limiting lazy operations to
>>>> iterators
>>>> > (or streams - the name is immaterial) is undesirable? In terms of
>>>> code, it
>>>> > is exactly the same as the proposal which started the thread, with a
>>>> need
>>>> > for one extra method call -  iterator() rather than asStream() - on
>>>> the
>>>> > collection, before you are in the lazy land. In terms of semantics, it
>>>> > avoids introduction of new abstractions where existing ones are
>>>> sufficient
>>>> > to capture the intent (iterators are inherently lazy).
>>>> >
>>>> > On Sat, Mar 12, 2011 at 7:53 PM, Colin Decker <cgdecker at gmail.com>
>>>> wrote:
>>>> >
>>>> >> I don't feel like a comparison to InputStream etc. makes much sense
>>>> >> considering "Stream" is an arbitrary name that can easily be changed
>>>> to
>>>> >> something more appropriate ("Streamable", say). I also think it's a
>>>> weakness
>>>> >> of the IO streams that they don't have an Iterable equivalent (like
>>>> >> InputSupplier from Guava). Regardless, limiting lazy filter/map to
>>>> Iterator
>>>> >> would be... undesirable. I do think Iterator should have such methods
>>>> as
>>>> >> well, though.
>>>> >>
>>>> >> --
>>>> >> Colin
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Sat, Mar 12, 2011 at 8:30 PM, Pavel Minaev <int19h at gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> I've been re-reading the "Design for collection upgrades" thread,
>>>> and had
>>>> >>> some thoughts on the nature of the Stream abstraction (for
>>>> lazy/serial
>>>> >>> scenario) as originally outlined in Brian's post that kicked off the
>>>> >>> thread,
>>>> >>> and users expectations of what comes with that name.
>>>> >>>
>>>> >>> Per the proposal, it sounds like Stream is more or less a marker
>>>> >>> interface
>>>> >>> to indicate lazy operations, and not otherwise any different from
>>>> >>> Iterable.
>>>> >>> However, this isn't what "stream" has historically meant in
>>>> imperative
>>>> >>> PLs
>>>> >>> and their frameworks. For some examples of what I mean, let me point
>>>> at
>>>> >>> java.io.InputStream, or at C++ std::basic_istream. In both cases,
>>>> the
>>>> >>> fundamental property of the streams - the one that is guaranteed to
>>>> be
>>>> >>> supported by _any_ stream - is that you can read elements.
>>>> InputStream
>>>> >>> also
>>>> >>> offers mark()/reset(), but those are conditionally supported, and so
>>>> are
>>>> >>> not
>>>> >>> part of the basic contract of the abstraction; they could (one could
>>>> even
>>>> >>> argue that, from a pure OOD approach, they should) just as well be
>>>> moved
>>>> >>> to
>>>> >>> a more concrete interface that could be queried by clients instead
>>>> of
>>>> >>> using
>>>> >>> markSupported().
>>>> >>>
>>>> >>> The important thing with that basic contract is that once a value is
>>>> read
>>>> >>> from a stream, it stays read: a stream by itself is a read-once
>>>> sequence.
>>>> >>> There may be ways to obtain another stream that would re-read the
>>>> same
>>>> >>> exact
>>>> >>> elements, but that would still be a new stream object. For streams
>>>> that
>>>> >>> wrap
>>>> >>> some data store (e.g. FileInputStream or ByteArrayInputStream), the
>>>> >>> stream
>>>> >>> object is essentially a cursor into the store - it has a current
>>>> >>> position,
>>>> >>> which every read advances towards the end of the store. Furthermore,
>>>> the
>>>> >>> stream is not _the_ store - you can have several FileInputStreams
>>>> over
>>>> >>> the
>>>> >>> same file, or several ByteArrayInputStreams sharing the same byte
>>>> array.
>>>> >>>
>>>> >>> Now if we take the above and see how it applies to collections, it
>>>> >>> actually
>>>> >>> is a very familiar concept: something that is not a collection
>>>> itself,
>>>> >>> but
>>>> >>> is a forward-only cursor for a collection, and each collection may
>>>> have
>>>> >>> more
>>>> >>> than one such cursor - why, that is Iterator. Its next() and
>>>> hasNext()
>>>> >>> methods match exactly the basic input stream contract; remove() does
>>>> not,
>>>> >>> but it is conditionally-supported anyway. An analogy to highlight
>>>> this
>>>> >>> point: Iterator is to Iterable/Collection what ByteArrayInputStream
>>>> is to
>>>> >>> byte array.
>>>> >>>
>>>> >>> Iterators, being cursors, are also naturally expected to be lazy by
>>>> API
>>>> >>> clients - if I provide a method that takes an iterator, applies a
>>>> filter
>>>> >>> to
>>>> >>> it, and returns another iterator as a result, no-one would expect
>>>> >>> filtering
>>>> >>> to occur over the entire collection right there and then. It's clear
>>>> that
>>>> >>> the result of such operation is a "filtered iterator", that would
>>>> apply
>>>> >>> the
>>>> >>> filter as it is being iterated.
>>>> >>>
>>>> >>> So, then, why not put lazy map/filter/reduce/... on Iterator? Thus,
>>>> >>> Brian's
>>>> >>> original serial/lazy example would become:
>>>> >>>
>>>> >>>    maxFooWeight = things.iterator()
>>>> >>>                               .filter(#Thing.isFoo)
>>>> >>>                              .map(#Thing.getWeight)
>>>> >>>                              .max();
>>>> >>>
>>>> >>> or maybe with some less concise but more descriptive (and - purely
>>>> >>> subjectively - "Java-like") names:
>>>> >>>
>>>> >>>    maxFooWeight = things.iterator()
>>>> >>>                              .withFilter(#Thing.isFoo)
>>>> >>>                              .withTransform(#Thing.getWeight)
>>>> >>>                              .getLargestValue();
>>>> >>>
>>>> >>> This has a nice property of working with a more fundamental
>>>> abstraction
>>>> >>> than
>>>> >>> collections - I can write an Iterator that wraps a non-rewindable
>>>> I/O
>>>> >>> InputStream, but I cannot write such an Iterable (well, I can - if
>>>> it
>>>> >>> throws
>>>> >>> after the first call to iterator() - but it will only be usable with
>>>> APIs
>>>> >>> which specify that they only call iterator() on the provided
>>>> Iterable
>>>> >>> once
>>>> >>> as part of their public contract; otherwise, I'm relying on an
>>>> >>> implementation detail).
>>>> >>>
>>>> >>> The only major annoyance I can see with this approach is that
>>>> >>> enhanced-for-loop only supports Iterables (and arrays) but not
>>>> Iterators,
>>>> >>> and so you'd have to write a manual loop with next()/hasNext() to
>>>> iterate
>>>> >>> over the result. But is there any reason why enhanced-for cannot be
>>>> made
>>>> >>> to
>>>> >>> support Iterators directly? The only thing it does to the provided
>>>> >>> Iterable
>>>> >>> is to call iterator() on it, and it does it exactly once; it would
>>>> just
>>>> >>> need
>>>> >>> to use the provided Iterator directly instead. It sounds like it
>>>> would be
>>>> >>> trivial to add.
>>>> >>>
>>>> >>>
>>>> >>> The above didn't touch on what the design would look like for eager
>>>> or
>>>> >>> lazy/parallel operations. For parallel, the original design can be
>>>> >>> trivially
>>>> >>> adapted by moving asParallel() to Iterator directly, and producing
>>>> some
>>>> >>> sort
>>>> >>> of ParallelIterator, which is simply a marker interface to enable
>>>> >>> parallelization for all applied operations, but otherwise is the
>>>> same as
>>>> >>> Iterator (probably just extending it).
>>>> >>>
>>>> >>> For eager operations, I would prefer in-place mutating methods
>>>> returning
>>>> >>> the
>>>> >>> receiver (to permit chaining), with distinct but obviously mapped
>>>> names.
>>>> >>> For
>>>> >>> example, if Iterator has withFilter(), then Collection would have
>>>> >>> filter().
>>>> >>> I don't think there is much utility in having eager ops that do a
>>>> copy,
>>>> >>> and
>>>> >>> even less so for chaining such. I think a simple addition of
>>>> something
>>>> >>> like
>>>> >>> clone() to Collection (default implementation could do newInstance()
>>>> >>> assuming a no-arg constructor available, and then addAll(this) on
>>>> that
>>>> >>> new
>>>> >>> instance) would cover vast majority of all cases where you want to
>>>> get a
>>>> >>> copy: the usual pattern would then be to do something like:
>>>> >>>
>>>> >>>   Collection<int> getWeights() {
>>>> >>>       return things.clone().filter(#Thing.isFoo).
>>>> >>> transform(#Thing.getWeight);
>>>> >>>   }
>>>> >>>
>>>> >>> where both filter() and map() operate in-place on the cloned
>>>> collection.
>>>> >>> This also skirts the whole question of the type of the resulting
>>>> >>> collection
>>>> >>> produced by a copying operation - it's clear and unambiguous what
>>>> it'll
>>>> >>> be
>>>> >>> for clone(), and no other operation makes a copy.
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Mar 8, 2011 at 9:23 AM, Brian Goetz <brian.goetz at oracle.com
>>>> >
>>>> >>> wrote:
>>>> >>>
>>>> >>> > Since people are already discussing this based on an experimental
>>>> >>> > checkin, let me outline the big picture plan here.
>>>> >>> >
>>>> >>> > The general idea is to add functional-like operations to
>>>> collections --
>>>> >>> > filter, map, reduce, apply.
>>>> >>> >
>>>> >>> > I see three sensible modes, with explicit choices of which you
>>>> get.
>>>> >>> >
>>>> >>> > 1.  Serial / Eager.  This is the straight
>>>> >>> > collections-with-functional-style mode, and some samples have
>>>> already
>>>> >>> > been checked in as proof of concept.  Operations on collections
>>>> yield
>>>> >>> > new collections, and you can chain the calls.  It values ease of
>>>> use
>>>> >>> > over performance (no new concepts like laziness), but the
>>>> performance
>>>> >>> > model is still highly predictable.  You get things like
>>>> >>> >
>>>> >>> >      Collection fooAbles = things.filter( #{ t -> t.isFoo() });
>>>> >>> >
>>>> >>> > or, with method references:
>>>> >>> >
>>>> >>> >      Collection fooAbles = things.filter(#Thing.isFoo); // ooh,
>>>> pretty
>>>> >>> >
>>>> >>> > You can also chain calls together, though you pay a (predictable)
>>>> >>> > performance cost of intermediate collections, which for small
>>>> >>> > collections is unlikely to matter:
>>>> >>> >
>>>> >>> >      maxFooWeight = things.filter(#Thing.isFoo)
>>>> >>> >                           .map(#Thing.getWeight)
>>>> >>> >                           .max();
>>>> >>> >
>>>> >>> > The benefit here is concision and clarity.  The cost is some
>>>> >>> > performance, but maybe not so much that people freak out.  If
>>>> people
>>>> >>> > care, they move to the next model, which is:
>>>> >>> >
>>>> >>> > 2.  Serial / Lazy.  Here, the primary abstraction is Stream (name
>>>> to be
>>>> >>> > chosen later, Remi used "lazy" in his example.)  To transfer
>>>> between
>>>> >>> > "eager world" and "lazy world", you use conversion methods
>>>> (toStream /
>>>> >>> > toCollection).  A typical call chain probably looks like:
>>>> >>> >   collection.toStream / op / op / op / {toCollection,reduce,apply}
>>>> >>> >
>>>> >>> > so the above example becomes
>>>> >>> >
>>>> >>> >      maxFooWeight = things.asStream()
>>>> >>> >                           .filter(#Thing.isFoo)
>>>> >>> >                           .map(#Thing.getWeight)
>>>> >>> >                           .max();
>>>> >>> >
>>>> >>> > The return type of Collection.filter is different from the return
>>>> type
>>>> >>> > of Stream.filter, so the choice and performance costs are
>>>> reflected in
>>>> >>> > the static type system.  This avoids the cost of the intermediate
>>>> >>> > collections, but is still serial.  If you care about that, you
>>>> move up
>>>> >>> > to the next model, which is:
>>>> >>> >
>>>> >>> > 3.  Parallel / Lazy.  Here, the primary abstraction is something
>>>> like
>>>> >>> > ParallelStream or ParallelIterable.  Let's call it ParallelFoo to
>>>> avoid
>>>> >>> > bikeshedding for the moment.  Now, the code looks like:
>>>> >>> >
>>>> >>> >      maxFooWeight = things.asParallelFoo()
>>>> >>> >                           .filter(#Thing.isFoo)
>>>> >>> >                           .map(#Thing.getWeight)
>>>> >>> >                           .max();
>>>> >>> >
>>>> >>> > Again, the return type of ParallelFoo.filter is different from
>>>> >>> > Stream.filter or Collection.filter, so again the choice is
>>>> reflected in
>>>> >>> > the static type system.  But you don't have to rewrite your code.
>>>> >>> >
>>>> >>> > The beauty here is twofold:
>>>> >>> >
>>>> >>> >  - The base model (serial/eager) is easy to understand and natural
>>>> to
>>>> >>> > use as a way of expressing what the programmer wants to do, and
>>>> >>> > attractive enough to stand on its own -- just a little slow with
>>>> big
>>>> >>> > collections.
>>>> >>> >  - Switching between execution models is mostly a matter of adding
>>>> an
>>>> >>> > explicit conversion or two in the call chain, as the models are
>>>> similar
>>>> >>> > enough that the rest of the code should still work (and even mean
>>>> the
>>>> >>> > same thing.)
>>>> >>> >
>>>> >>> >
>>>> >>> > On 3/8/2011 8:43 AM, Rémi Forax wrote:
>>>> >>> > >    Le 08/03/2011 14:31, Jim Mayer a écrit :
>>>> >>> > >> // I can tolerate this one
>>>> >>> > >>       long product(List<Integer>   list) {
>>>> >>> > >>         return list.map(#{x ->   (long) x}).reduce(0L, #{sum, x
>>>> ->
>>>> >>> sum
>>>> >>> > * x});
>>>> >>> > >>       }
>>>> >>> > >
>>>> >>> > > I prefer this one:
>>>> >>> > >
>>>> >>> > >     long product(List<Integer>   list) {
>>>> >>> > >         return list.lazy().map(#{x ->   (long) x}).reduce(0L,
>>>> #{sum,
>>>> >>> x ->
>>>> >>> >   sum * x});
>>>> >>> > >     }
>>>> >>> > >
>>>> >>> > > lazy() means, don't do map directly, but wait and do map and
>>>> reduce
>>>> >>> in
>>>> >>> > > one iteration.
>>>> >>> > >
>>>> >>> > > Rémi
>>>> >>> > >
>>>> >>> > >
>>>> >>> >
>>>> >>> >
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>>
>>>
>>
>