Design for collections upgrades (was: Re: lambda-dev Digest, Vol 15, Issue 20 [reduce method result type])

Sun Mar 13 10:42:49 PDT 2011

You can for things like the Fibonacci series that can be generated from
scratch.  You can't for things like data being read off of a stream or data
being read from a sensor.

In the case of the Fibonacci series, by all means let them implement
Iteratable.  It's a cool data structure.

-- Jim

On Sun, Mar 13, 2011 at 1:28 PM, Colin Decker <cgdecker at gmail.com> wrote:

> I absolutely agree that you shouldn't make an Iterable that can only be
> used once. Why do you think you can't have a reusable Iterable that
> represents an infinite series or a large data set though? If you can create
> an Iterator representing an infinite series, why couldn't you create that
> Iterator more than once (in which case you can have an Iterable)?
>
> --
> Colin
>
>
>
> On Sun, Mar 13, 2011 at 1:24 PM, Jim Mayer <jim at pentastich.org> wrote:
>
>> Yes, you can "do" that with an iterable, but I don't like the idea of an
>> iterable that can only be used once.  Most, if not all, classes that
>> implement Iterable today have the property that you can call 'iterator'
>> multiple times to get independent iterators.
>>
>> -- Jim
>>
>>
>> On Sun, Mar 13, 2011 at 1:09 PM, Colin Decker <cgdecker at gmail.com> wrote:
>>
>>> Iterable can represent an infinite series or a very large data set just
>>> as well as Iterator, given that it is just a factory for Iterators... it
>>> isn't tied to Collections. Obviously there are some cases where you just
>>> want an Iterator, and there's no reason that lazy filtering, mapping, etc.
>>> can't be done on an Iterator as well.
>>>
>>> --
>>> Colin
>>>
>>>
>>>
>>> On Sun, Mar 13, 2011 at 12:38 PM, Jim Mayer <jim at pentastich.org> wrote:
>>>
>>>> I have a different take on this.  Non-reusable streams (e.g., Iterator,
>>>> not Iterable) can be used to represent data structures that cannot be easily
>>>> regenerated.  Infinite series come to mind, as well as data sets that are
>>>> very large or come from non-determinate sources.  Since lazily evaluated
>>>> streams can represent these gracefully and collections cannot (what do you
>>>> return for the size of a infinte series?), why cripple the stream
>>>> abstraction by tying it to collections? On the other hand, if a stream is
>>>> backed by a collection, one can simply call "asStream()" again to get a
>>>> "rewound" series.
>>>>
>>>> So, instead of using Iterable, I'd prefer something like this:
>>>>
>>>> (1) Collections sport an "asStream" (or even the existing "iterator()"
>>>> method, if Iterator turns out to be flexible enough).
>>>> (2) That whatever class/interface is the result of "asStream" support an
>>>> API for creating instances outside of the Collections framework (note that
>>>> Iterator already does this).
>>>> (3) That we consider making one of the operators something that can
>>>> "truncate" an infinite series.
>>>>
>>>> With this kind of framework one could have the same kind of thing we've
>>>> been talking about:
>>>>
>>>>     int sum(Collection<Integer> s) {
>>>>         return s.asStream().fold(#{sum, x -> sum + x}, 0);
>>>>     }
>>>>
>>>> but could also have:
>>>>
>>>>     int sum(BufferedReader r) {
>>>>         return r.asStream(r#readLine).map(Integer#valueOf).fold(#{sum,
>>>> x-> sum + x}, 0);
>>>>     }
>>>>
>>>> or even:
>>>>
>>>>     int sum(StreamTypeThing s, int count) {
>>>>         return s.prefix(count).fold(#{sum, x -> sum + x}, 0);
>>>>     }
>>>>
>>>> As a side note, I agree with one of the earlier posters that if we did
>>>> something like this we'd want to allow stuff like:
>>>>
>>>>     for (String x : c.asStream().map(...)) {}
>>>>
>>>> -- Jim
>>>>
>>>> On Sat, Mar 12, 2011 at 11:37 PM, Colin Decker <cgdecker at gmail.com>wrote:
>>>>
>>>>> Because there's no need to and because Iterables (being reusable) are
>>>>> far
>>>>> more flexible than Iterators. The Iterable interface is just as lazy as
>>>>> Iterator. The Stream interface should be Iterable.
>>>>>
>>>>> --
>>>>> Colin
>>>>>
>>>>>
>>>>> On Sat, Mar 12, 2011 at 11:08 PM, Pavel Minaev <int19h at gmail.com>
>>>>> wrote:
>>>>>
>>>>> > I'm curious as to why you think that limiting lazy operations to
>>>>> iterators
>>>>> > (or streams - the name is immaterial) is undesirable? In terms of
>>>>> code, it
>>>>> > is exactly the same as the proposal which started the thread, with a
>>>>> need
>>>>> > for one extra method call -  iterator() rather than asStream() - on
>>>>> the
>>>>> > collection, before you are in the lazy land. In terms of semantics,
>>>>> it
>>>>> > avoids introduction of new abstractions where existing ones are
>>>>> sufficient
>>>>> > to capture the intent (iterators are inherently lazy).
>>>>> >
>>>>> > On Sat, Mar 12, 2011 at 7:53 PM, Colin Decker <cgdecker at gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >> I don't feel like a comparison to InputStream etc. makes much sense
>>>>> >> considering "Stream" is an arbitrary name that can easily be changed
>>>>> to
>>>>> >> something more appropriate ("Streamable", say). I also think it's a
>>>>> weakness
>>>>> >> of the IO streams that they don't have an Iterable equivalent (like
>>>>> >> InputSupplier from Guava). Regardless, limiting lazy filter/map to
>>>>> Iterator
>>>>> >> would be... undesirable. I do think Iterator should have such
>>>>> methods as
>>>>> >> well, though.
>>>>> >>
>>>>> >> --
>>>>> >> Colin
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Sat, Mar 12, 2011 at 8:30 PM, Pavel Minaev <int19h at gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >>> I've been re-reading the "Design for collection upgrades" thread,
>>>>> and had
>>>>> >>> some thoughts on the nature of the Stream abstraction (for
>>>>> lazy/serial
>>>>> >>> scenario) as originally outlined in Brian's post that kicked off
>>>>> the
>>>>> >>> thread,
>>>>> >>> and users expectations of what comes with that name.
>>>>> >>>
>>>>> >>> Per the proposal, it sounds like Stream is more or less a marker
>>>>> >>> interface
>>>>> >>> to indicate lazy operations, and not otherwise any different from
>>>>> >>> Iterable.
>>>>> >>> However, this isn't what "stream" has historically meant in
>>>>> imperative
>>>>> >>> PLs
>>>>> >>> and their frameworks. For some examples of what I mean, let me
>>>>> point at
>>>>> >>> java.io.InputStream, or at C++ std::basic_istream. In both cases,
>>>>> the
>>>>> >>> fundamental property of the streams - the one that is guaranteed to
>>>>> be
>>>>> >>> supported by _any_ stream - is that you can read elements.
>>>>> InputStream
>>>>> >>> also
>>>>> >>> offers mark()/reset(), but those are conditionally supported, and
>>>>> so are
>>>>> >>> not
>>>>> >>> part of the basic contract of the abstraction; they could (one
>>>>> could even
>>>>> >>> argue that, from a pure OOD approach, they should) just as well be
>>>>> moved
>>>>> >>> to
>>>>> >>> a more concrete interface that could be queried by clients instead
>>>>> of
>>>>> >>> using
>>>>> >>> markSupported().
>>>>> >>>
>>>>> >>> The important thing with that basic contract is that once a value
>>>>> is read
>>>>> >>> from a stream, it stays read: a stream by itself is a read-once
>>>>> sequence.
>>>>> >>> There may be ways to obtain another stream that would re-read the
>>>>> same
>>>>> >>> exact
>>>>> >>> elements, but that would still be a new stream object. For streams
>>>>> that
>>>>> >>> wrap
>>>>> >>> some data store (e.g. FileInputStream or ByteArrayInputStream), the
>>>>> >>> stream
>>>>> >>> object is essentially a cursor into the store - it has a current
>>>>> >>> position,
>>>>> >>> which every read advances towards the end of the store.
>>>>> Furthermore, the
>>>>> >>> stream is not _the_ store - you can have several FileInputStreams
>>>>> over
>>>>> >>> the
>>>>> >>> same file, or several ByteArrayInputStreams sharing the same byte
>>>>> array.
>>>>> >>>
>>>>> >>> Now if we take the above and see how it applies to collections, it
>>>>> >>> actually
>>>>> >>> is a very familiar concept: something that is not a collection
>>>>> itself,
>>>>> >>> but
>>>>> >>> is a forward-only cursor for a collection, and each collection may
>>>>> have
>>>>> >>> more
>>>>> >>> than one such cursor - why, that is Iterator. Its next() and
>>>>> hasNext()
>>>>> >>> methods match exactly the basic input stream contract; remove()
>>>>> does not,
>>>>> >>> but it is conditionally-supported anyway. An analogy to highlight
>>>>> this
>>>>> >>> point: Iterator is to Iterable/Collection what ByteArrayInputStream
>>>>> is to
>>>>> >>> byte array.
>>>>> >>>
>>>>> >>> Iterators, being cursors, are also naturally expected to be lazy by
>>>>> API
>>>>> >>> clients - if I provide a method that takes an iterator, applies a
>>>>> filter
>>>>> >>> to
>>>>> >>> it, and returns another iterator as a result, no-one would expect
>>>>> >>> filtering
>>>>> >>> to occur over the entire collection right there and then. It's
>>>>> clear that
>>>>> >>> the result of such operation is a "filtered iterator", that would
>>>>> apply
>>>>> >>> the
>>>>> >>> filter as it is being iterated.
>>>>> >>>
>>>>> >>> So, then, why not put lazy map/filter/reduce/... on Iterator? Thus,
>>>>> >>> Brian's
>>>>> >>> original serial/lazy example would become:
>>>>> >>>
>>>>> >>>    maxFooWeight = things.iterator()
>>>>> >>>                               .filter(#Thing.isFoo)
>>>>> >>>                              .map(#Thing.getWeight)
>>>>> >>>                              .max();
>>>>> >>>
>>>>> >>> or maybe with some less concise but more descriptive (and - purely
>>>>> >>> subjectively - "Java-like") names:
>>>>> >>>
>>>>> >>>    maxFooWeight = things.iterator()
>>>>> >>>                              .withFilter(#Thing.isFoo)
>>>>> >>>                              .withTransform(#Thing.getWeight)
>>>>> >>>                              .getLargestValue();
>>>>> >>>
>>>>> >>> This has a nice property of working with a more fundamental
>>>>> abstraction
>>>>> >>> than
>>>>> >>> collections - I can write an Iterator that wraps a non-rewindable
>>>>> I/O
>>>>> >>> InputStream, but I cannot write such an Iterable (well, I can - if
>>>>> it
>>>>> >>> throws
>>>>> >>> after the first call to iterator() - but it will only be usable
>>>>> with APIs
>>>>> >>> which specify that they only call iterator() on the provided
>>>>> Iterable
>>>>> >>> once
>>>>> >>> as part of their public contract; otherwise, I'm relying on an
>>>>> >>> implementation detail).
>>>>> >>>
>>>>> >>> The only major annoyance I can see with this approach is that
>>>>> >>> enhanced-for-loop only supports Iterables (and arrays) but not
>>>>> Iterators,
>>>>> >>> and so you'd have to write a manual loop with next()/hasNext() to
>>>>> iterate
>>>>> >>> over the result. But is there any reason why enhanced-for cannot be
>>>>> made
>>>>> >>> to
>>>>> >>> support Iterators directly? The only thing it does to the provided
>>>>> >>> Iterable
>>>>> >>> is to call iterator() on it, and it does it exactly once; it would
>>>>> just
>>>>> >>> need
>>>>> >>> to use the provided Iterator directly instead. It sounds like it
>>>>> would be
>>>>> >>> trivial to add.
>>>>> >>>
>>>>> >>>
>>>>> >>> The above didn't touch on what the design would look like for eager
>>>>> or
>>>>> >>> lazy/parallel operations. For parallel, the original design can be
>>>>> >>> trivially
>>>>> >>> adapted by moving asParallel() to Iterator directly, and producing
>>>>> some
>>>>> >>> sort
>>>>> >>> of ParallelIterator, which is simply a marker interface to enable
>>>>> >>> parallelization for all applied operations, but otherwise is the
>>>>> same as
>>>>> >>> Iterator (probably just extending it).
>>>>> >>>
>>>>> >>> For eager operations, I would prefer in-place mutating methods
>>>>> returning
>>>>> >>> the
>>>>> >>> receiver (to permit chaining), with distinct but obviously mapped
>>>>> names.
>>>>> >>> For
>>>>> >>> example, if Iterator has withFilter(), then Collection would have
>>>>> >>> filter().
>>>>> >>> I don't think there is much utility in having eager ops that do a
>>>>> copy,
>>>>> >>> and
>>>>> >>> even less so for chaining such. I think a simple addition of
>>>>> something
>>>>> >>> like
>>>>> >>> clone() to Collection (default implementation could do
>>>>> newInstance()
>>>>> >>> assuming a no-arg constructor available, and then addAll(this) on
>>>>> that
>>>>> >>> new
>>>>> >>> instance) would cover vast majority of all cases where you want to
>>>>> get a
>>>>> >>> copy: the usual pattern would then be to do something like:
>>>>> >>>
>>>>> >>>   Collection<int> getWeights() {
>>>>> >>>       return things.clone().filter(#Thing.isFoo).
>>>>> >>> transform(#Thing.getWeight);
>>>>> >>>   }
>>>>> >>>
>>>>> >>> where both filter() and map() operate in-place on the cloned
>>>>> collection.
>>>>> >>> This also skirts the whole question of the type of the resulting
>>>>> >>> collection
>>>>> >>> produced by a copying operation - it's clear and unambiguous what
>>>>> it'll
>>>>> >>> be
>>>>> >>> for clone(), and no other operation makes a copy.
>>>>> >>>
>>>>> >>>
>>>>> >>> On Tue, Mar 8, 2011 at 9:23 AM, Brian Goetz <
>>>>> brian.goetz at oracle.com>
>>>>> >>> wrote:
>>>>> >>>
>>>>> >>> > Since people are already discussing this based on an experimental
>>>>> >>> > checkin, let me outline the big picture plan here.
>>>>> >>> >
>>>>> >>> > The general idea is to add functional-like operations to
>>>>> collections --
>>>>> >>> > filter, map, reduce, apply.
>>>>> >>> >
>>>>> >>> > I see three sensible modes, with explicit choices of which you
>>>>> get.
>>>>> >>> >
>>>>> >>> > 1.  Serial / Eager.  This is the straight
>>>>> >>> > collections-with-functional-style mode, and some samples have
>>>>> already
>>>>> >>> > been checked in as proof of concept.  Operations on collections
>>>>> yield
>>>>> >>> > new collections, and you can chain the calls.  It values ease of
>>>>> use
>>>>> >>> > over performance (no new concepts like laziness), but the
>>>>> performance
>>>>> >>> > model is still highly predictable.  You get things like
>>>>> >>> >
>>>>> >>> >      Collection fooAbles = things.filter( #{ t -> t.isFoo() });
>>>>> >>> >
>>>>> >>> > or, with method references:
>>>>> >>> >
>>>>> >>> >      Collection fooAbles = things.filter(#Thing.isFoo); // ooh,
>>>>> pretty
>>>>> >>> >
>>>>> >>> > You can also chain calls together, though you pay a (predictable)
>>>>> >>> > performance cost of intermediate collections, which for small
>>>>> >>> > collections is unlikely to matter:
>>>>> >>> >
>>>>> >>> >      maxFooWeight = things.filter(#Thing.isFoo)
>>>>> >>> >                           .map(#Thing.getWeight)
>>>>> >>> >                           .max();
>>>>> >>> >
>>>>> >>> > The benefit here is concision and clarity.  The cost is some
>>>>> >>> > performance, but maybe not so much that people freak out.  If
>>>>> people
>>>>> >>> > care, they move to the next model, which is:
>>>>> >>> >
>>>>> >>> > 2.  Serial / Lazy.  Here, the primary abstraction is Stream (name
>>>>> to be
>>>>> >>> > chosen later, Remi used "lazy" in his example.)  To transfer
>>>>> between
>>>>> >>> > "eager world" and "lazy world", you use conversion methods
>>>>> (toStream /
>>>>> >>> > toCollection).  A typical call chain probably looks like:
>>>>> >>> >   collection.toStream / op / op / op /
>>>>> {toCollection,reduce,apply}
>>>>> >>> >
>>>>> >>> > so the above example becomes
>>>>> >>> >
>>>>> >>> >      maxFooWeight = things.asStream()
>>>>> >>> >                           .filter(#Thing.isFoo)
>>>>> >>> >                           .map(#Thing.getWeight)
>>>>> >>> >                           .max();
>>>>> >>> >
>>>>> >>> > The return type of Collection.filter is different from the return
>>>>> type
>>>>> >>> > of Stream.filter, so the choice and performance costs are
>>>>> reflected in
>>>>> >>> > the static type system.  This avoids the cost of the intermediate
>>>>> >>> > collections, but is still serial.  If you care about that, you
>>>>> move up
>>>>> >>> > to the next model, which is:
>>>>> >>> >
>>>>> >>> > 3.  Parallel / Lazy.  Here, the primary abstraction is something
>>>>> like
>>>>> >>> > ParallelStream or ParallelIterable.  Let's call it ParallelFoo to
>>>>> avoid
>>>>> >>> > bikeshedding for the moment.  Now, the code looks like:
>>>>> >>> >
>>>>> >>> >      maxFooWeight = things.asParallelFoo()
>>>>> >>> >                           .filter(#Thing.isFoo)
>>>>> >>> >                           .map(#Thing.getWeight)
>>>>> >>> >                           .max();
>>>>> >>> >
>>>>> >>> > Again, the return type of ParallelFoo.filter is different from
>>>>> >>> > Stream.filter or Collection.filter, so again the choice is
>>>>> reflected in
>>>>> >>> > the static type system.  But you don't have to rewrite your code.
>>>>> >>> >
>>>>> >>> > The beauty here is twofold:
>>>>> >>> >
>>>>> >>> >  - The base model (serial/eager) is easy to understand and
>>>>> natural to
>>>>> >>> > use as a way of expressing what the programmer wants to do, and
>>>>> >>> > attractive enough to stand on its own -- just a little slow with
>>>>> big
>>>>> >>> > collections.
>>>>> >>> >  - Switching between execution models is mostly a matter of
>>>>> adding an
>>>>> >>> > explicit conversion or two in the call chain, as the models are
>>>>> similar
>>>>> >>> > enough that the rest of the code should still work (and even mean
>>>>> the
>>>>> >>> > same thing.)
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > On 3/8/2011 8:43 AM, Rémi Forax wrote:
>>>>> >>> > >    Le 08/03/2011 14:31, Jim Mayer a écrit :
>>>>> >>> > >> // I can tolerate this one
>>>>> >>> > >>       long product(List<Integer>   list) {
>>>>> >>> > >>         return list.map(#{x ->   (long) x}).reduce(0L, #{sum,
>>>>> x ->
>>>>> >>> sum
>>>>> >>> > * x});
>>>>> >>> > >>       }
>>>>> >>> > >
>>>>> >>> > > I prefer this one:
>>>>> >>> > >
>>>>> >>> > >     long product(List<Integer>   list) {
>>>>> >>> > >         return list.lazy().map(#{x ->   (long) x}).reduce(0L,
>>>>> #{sum,
>>>>> >>> x ->
>>>>> >>> >   sum * x});
>>>>> >>> > >     }
>>>>> >>> > >
>>>>> >>> > > lazy() means, don't do map directly, but wait and do map and
>>>>> reduce
>>>>> >>> in
>>>>> >>> > > one iteration.
>>>>> >>> > >
>>>>> >>> > > Rémi
>>>>> >>> > >
>>>>> >>> > >
>>>>> >>> >
>>>>> >>> >
>>>>> >>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>
>