Into
Remi Forax
forax at univ-mlv.fr
Sat Dec 22 09:55:56 PST 2012
On 12/22/2012 06:46 PM, Brian Goetz wrote:
> I was not proposing "let's replace into with toList". I already
> proposed in a previous message "let's replace into() with something
> whose semantics are more like reduce/forEach rather than the current
> weird semantics". The recent developments in reduce give us a
> consistent framework for doing that.
You want to replace into() with a kind of reducers and toList is a way
to hide the implementation of that kind of reducers so ...
>
> Into is bad. It seemed clever at the time we first thought of it
> (turning the problem around on the target collection), but if you look
> at its semantics and its implementation, its complicated, it's
> neither-here-nor-there, its unlike any other op, and the obvious
> usages will likely parallelize terribly. Something that is more
> reduce-like (or forEach like with a concurrent collection, if we had
> one) will be much easier to understand AND will perform better.
I understand why into() is bad when you call parallel() on the stream
but it's also verify useful if you don't call parallel on a stream
because at the boundary of methods people can use collections and are
able to choose the implementation.
>
> So the question is not "should we replace into with toList." We have
> to have a discussion about what we replace into() with. This query
> was about whether sequential() still has a use other than as a (bad)
> implementation crutch for into(). And it seems that there is none,
> which is nice -- one less weird operation, replaced with one more
> version of the reduce swiss-army-knife.
>
> If there is a toList() it will (obviously) be sugar on top of
> something more general.
and here you agree with me that toList() on a parallel stream is a way
to abstract the implementation of some magic concurrent reducer.
Rémi
>
>
> On 12/22/2012 12:29 PM, Remi Forax wrote:
>> On 12/22/2012 06:16 PM, Brian Goetz wrote:
>>> Right. You want to do the upstream stuff in parallel, then you want to
>>> do the downstream stuff (a) serially, (b) in the current thread, and
>>> probably (c) in encounter order.
>>>
>>> So, assume for sake of discussion that we have some form of .toList(),
>>> whether as a "native" operation or some sort of
>>> reduce/combine/tabulate. Then you can say:
>>>
>>> parallelStream()...toList().forEach(...)
>>>
>>> and the list-building will happen in parallel and then forEach can
>>> help sequentially.
>>>
>>> Given that, is there any reason left for sequential()?
>>
>> You need also a toSet() and yes, in that case you don't need
>> sequential() and it's better because it's more explicit.
>> But there is still a lot of cases where you are sequential and you want
>> to control the destination collection,
>> that's why we need into().
>>
>> I don't think it's a good idea to oppose toList() and into() both have
>> their purposes.
>>
>> Rémi
>>
>>>
>>>
>>>
>>> On 12/22/2012 11:55 AM, Joe Bowbeer wrote:
>>>> The main use case is sequential().forrEach(), which inserts any ol'
>>>> for-loop into a computation.
>>>>
>>>> On Dec 21, 2012 1:48 PM, "Brian Goetz" <brian.goetz at oracle.com
>>>> <mailto:brian.goetz at oracle.com>> wrote:
>>>>
>>>> If we get rid of into(), and replace it with an explicit reduction
>>>> [1], then we may be able to get rid of sequential() too.
>>>>
>>>> The primary use case for sequential() was for using
>>>> non-thread-safe
>>>> Collections as an into() target. The convoluted into() turns the
>>>> problem around on the target, calling target.addAll(stream) on it,
>>>> and the typical implementation of into() is like the one in
>>>> Collection:
>>>>
>>>> default void addAll(Stream<? extends E> stream) {
>>>> if (stream.isParallel())
>>>> stream = stream.sequential();
>>>> stream.forEach(this::add);
>>>> }
>>>>
>>>> or more compactly
>>>>
>>>> default void addAll(Stream<? extends E> stream) {
>>>> stream.sequential().forEach(__this::add);
>>>> }
>>>>
>>>> since sequential() is now a no-op on sequential streams.
>>>>
>>>> The code looks pretty, but the implementation is not; sequential()
>>>> is a barrier, meaning you have to stop and collect all the
>>>> elements
>>>> into a temporary tree, and then dump them into the target. But it
>>>> is not obvious that it is a barrier, so people will be surprised.
>>>> (And on infinite streams, it's bad.)
>>>>
>>>> What used to be implicit in sequential() can now be made explicit
>>>> with:
>>>>
>>>> if (stream.isParallel())
>>>> stream =
>>>> stream...__whateverWeCallOrderPreservingI__nto().stream()
>>>>
>>>> That offers similar semantics and at least as good performance,
>>>> while also being more transparent and requiring one fewer weird
>>>> stream operation.
>>>>
>>>> (I don't yet see the way to getting rid of .parallel(), but we can
>>>> possibly move it out of Stream and into a static method
>>>> Streams.parallel(stream), at some loss of discoverability. But we
>>>> can discuss.)
>>>>
>>>>
>>>> [1] Actually, we have to replace it with two explicit
>>>> reductions, or
>>>> more precisely, a reduction and a for-eaching. One is the pure
>>>> reduction case that involves merging, and is suitable for
>>>> non-thread-safe collections (and required if order preservation is
>>>> desired); the other is the concurrent case, where we bombard a
>>>> concurrent collection with puts and hope it manages to sort them
>>>> out. The two are semantically very different; one is a reduce and
>>>> the other is a forEach, and so they should have different
>>>> manifestations in the code. Though there are really no concurrent
>>>> Collections right now anyway (though you could fake a
>>>> concurrent Set
>>>> with a concurrent Map.)
>>>>
>>>>
>>>>
>>>> On 12/21/2012 12:50 PM, Brian Goetz wrote:
>>>>
>>>> I'm starting to dislike "into".
>>>>
>>>> First, it's the only stream method which retains mutable state
>>>> from the
>>>> user. That's not great.
>>>>
>>>> Second, the parallel story is bad. People are going to write
>>>>
>>>> list.parallel(e -> e+1).into(new ArrayList<>());
>>>>
>>>> which will do a whole lot of trivial computation in parallel,
>>>> wait on
>>>> the barrier implicit in sequential(), and then do an O(n)
>>>> serial
>>>> thing.
>>>>
>>>> Third, the semantics are weird; we do this clever trick where
>>>> collections have to decide whether to do insertion in
>>>> serial or
>>>> parallel. But as we all learned from Spinal Tap, there's a
>>>> fine
>>>> line
>>>> between clever and stupid.
>>>>
>>>> Instead, we could treat this like a mutable reduce, where
>>>> leaves are
>>>> reduced to a List, and lists are merged as we go up the tree.
>>>> Even with
>>>> dumb merging is still going to be much faster than what we've
>>>> got now;
>>>> no barrier, no buffer the whole thing and copy, and the worst
>>>> serial
>>>> step is O(n/2) instead of O(n). So probably 3x better just by
>>>> improving
>>>> the serial fractions. But with a smarter combination step, we
>>>> can do
>>>> better still. If we have a "concatenated list view"
>>>> operation (List
>>>> concat(List a, List b)), which returns a read-only, conc-tree
>>>> representation), then the big serial stage goes away.
>>>>
>>>> And, of course, building atop reduce makes the whole thing
>>>> simpler;
>>>> there are fewer ops that have their own distinct semantics,
>>>> and the
>>>> semantics of into() is about as weird as you get.
>>>>
>>>>
>>>> Now that the tabulators framework gets users comfortable with
>>>> the
>>>> explicit choice between functional and concurrent aggregation
>>>> for
>>>> tabulation, it is a much shorter hop to get there. So let's
>>>> build on
>>>> that and find some sort of way to surface mutable and
>>>> concurrent
>>>> versions of "into". (Currently we have no good concurrent
>>>> list-shaped
>>>> collections, but hopefully that changes.)
>>>>
>>>> Something like:
>>>>
>>>> Stream.tabulate(collector(__ArrayList::new))
>>>> Stream.tabulate(__concurrentCollector(__ConcurrentFooList::new))
>>>>
>>>> Maybe with some rename of tabulate.
>>>>
>>>> I think there's a small reorganization of naming lurking here
>>>> (involving
>>>> tabulate, Tabulator, ConcurrentTabulator, MutableReducer,
>>>> reduce) that
>>>> recasts into() either as an explicit functional or concurrent
>>>> tabulation. And one more tricky+slow special-purpose op bites
>>>> the dust,
>>>> in favor of something that builds on our two favorite
>>>> primitives, fold
>>>> (order-preserving) and forEach (not order-preserving.)
>>>>
>>>>
>>
More information about the lambda-libs-spec-observers
mailing list