Allowing wrapping/transforming of the final result of Stream.collect()
Brian Goetz
brian.goetz at oracle.com
Tue Apr 16 06:55:42 PDT 2013
> AFAICT Stream's into() has been completely superceded by collect(), and
> Stream.Destination does not exist in b83.
Correct.
> Does an equivalent now exist with collect()? I could not find an obvious
> way to do this, as both accumulator and combiner are expected to operate on
> elements or partial results while collecting is in progress, but are not
> given the chance to transform the final result.
This is correct.
> If there is no equivalent, are there plans to add one? It would be a shame
> to miss this out, when I imagine it would be well used, not just for
> immutable collections, but also their cousins, the unmodifiable wrappers,
> as well as synchronized wrappers, and creating sets from maps
> (Collections.newSetFromMap, a common way to obtain a concurrent hash
> set[1]).
Agreed that it is a shame. We spent a lot of time investigating this,
and it added more complexity than it first appeared, so we retreated to
the current state.
The easy case is easy: you want to apply a transform to the final
result. As in:
StringBuilder sb = stuff.map(Object::toString)
.collect(toStringBuilder());
String result = sb.toString();
Here, the Collector does not return what you actually want. Of course,
in this case, its easy to do it unobstrusively:
String result = stuff.map(Object::toString)
.collect(toStringBuilder())
.toString();
But there are other cases where you want to do slightly more, and would
like to apply a function to the result. For example, its easy to
compute average with an array of two longs:
long[] raw
= stuff.collect(() -> new long[2],
(la, x) -> { la[0] += x; la[1]++; },
(la, lb) -> { la[0] += lb[0]; la[1] += lb[2]; });
double avg = (raw[1] > 0) ? (double) raw[0] / raw[1] : 0.0;
Here, its not quite so easy to just say
.toDouble()
on the result, so you have to do an extra step. Really, what you want
to be able to to do is roll both into a Collector, where the internal
state (array of longs) is hidden and only the result is exposed. We get
that.
Where this falls apart is when you want to do this as the downstream
reduction in a composed reduction, such as "group transactions by
salesman and compute average sale". You want to get a Map<Salesman,
Double> rather than a Map<Salesman, long[]>. Right?
But, this starts to get pretty messy. At the top level, you know when
you're done -- because you're out of input. So there's an obvious and
efficient time to apply the post-transform and just return that. But at
the next level down, you don't know whether there are more values
associated with a key coming. So you have to instantiate something like
a Map<Salesman, long[]>, and then transform it into a Map<Salesman,
Double>. And if you have a three-level groupBy going on, it gets worse.
This is messy. You have two choices: create a new Map (potentially
hugely expensive) or create a view map (potentally hugely memory
wasteful, potentially CPU-wasteful if an expensive transform has to be
recomputed for multiple gets of the same key, and potentially moves a
significant fraction of the computation until after the user thinks it
is over.) And the library doesn't have the data with which to choose
sensibly between these approaches (you need to know how expensive the
"before" in-memory representation is, and how expensive the transform
is, and make tradeoffs between them.)
Further, the "single post-transform" is also likely to be a sequential
bottleneck. Since you don't apply the post transform to the leaves of
the computation tree, but only the root, the obvious approach leads you
to doing it sequentially. Which likely kills any parallelism you would
have gotten.
None of these are insurmountable engineering problems, but extending the
Collector API to support all these cases takes what is a mostly simple
API and turns it into something much uglier. We spent a lot of time
exploring this and did not come up with something that was acceptable.
Given that the user has far more information on what the right choice is
than the library does, it made sense to just let the user handle this,
despite the regrettable loss of fluency. But making the Collector API
far more complicated seemed worse.
> Another pattern that I use is to take an immutable Collection<T> and wrap
> it with my own type to provide methods in the domain language, e.g.:
>
> Trades trades = new Trades(Arrays.asList(trade1, trade2, ...));
> ...
> Date dateOfEarliest = trades.earliest().getTradedAtDate();
>
> where the implementation of earliest() internally uses collection-like
> operations (e.g. min operation with a comparator) but hides them behind a
> method expressed in the domain language. I would like, but am currently
> unable, to produce a new Trades() instance at the end of the collect()
> method. The point of this example is that while we can talk of unmodifiable
> and synchronized wrappers, etc, which could be provided in the JDK, this is
> a use case that would be impossible for the JDK to provide.
Right. You collect() to a Collection and then wrap with a Trades. You
could do it like this:
Collection c = stream...collect();
Trades t = new Trades(c);
or
Trades t = new Trades(stream...collect());
or
Stream s = stream...stuff...
Trades t = new Trades(s.collect(...));
> - there's already several forms of collect(), another version adding a
> method with another parameter to provide the final conversion would spend
> some of the complexity budget
It's more than that. Having a form
<R, Z> Z collect(Collector<T, R> c, Function<R,Z> f)
is not so bad -- if it had a lot of value, I'd surely consider one more
form. And the implementation is obviously trivial. But where's the
value? It just saves the caller from having to do:
Z z = f.apply(r);
after the collect. And it will be sequential, even on a parallel
pipeline, which may be surprising to some users.
Basically:
- In the cases where this works, its trivial for the caller to do it
themselves;
- In the cases where it's not trivial for the caller to do it
themselves, it requires a great deal of API complexity to capture all
the possibilities.
More information about the lambda-dev
mailing list