Allowing wrapping/transforming of the final result of Stream.collect()

Brian Goetz brian.goetz at oracle.com
Tue Apr 16 06:55:42 PDT 2013


> AFAICT Stream's into() has been completely superceded by collect(), and
> Stream.Destination does not exist in b83.

Correct.

> Does an equivalent now exist with collect()? I could not find an obvious
> way to do this, as both accumulator and combiner are expected to operate on
> elements or partial results while collecting is in progress, but are not
> given the chance to transform the final result.

This is correct.

> If there is no equivalent, are there plans to add one? It would be a shame
> to miss this out, when I imagine it would be well used, not just for
> immutable collections, but also their cousins, the unmodifiable wrappers,
> as well as synchronized wrappers, and creating sets from maps
> (Collections.newSetFromMap, a common way to obtain a concurrent hash
> set[1]).

Agreed that it is a shame.  We spent a lot of time investigating this, 
and it added more complexity than it first appeared, so we retreated to 
the current state.

The easy case is easy: you want to apply a transform to the final 
result.  As in:

   StringBuilder sb = stuff.map(Object::toString)
                           .collect(toStringBuilder());
   String result = sb.toString();

Here, the Collector does not return what you actually want.  Of course, 
in this case, its easy to do it unobstrusively:

   String result = stuff.map(Object::toString)
                        .collect(toStringBuilder())
                        .toString();

But there are other cases where you want to do slightly more, and would 
like to apply a function to the result.  For example, its easy to 
compute average with an array of two longs:

   long[] raw
       = stuff.collect(() -> new long[2],
                       (la, x) -> { la[0] += x; la[1]++; },
                       (la, lb) -> { la[0] += lb[0]; la[1] += lb[2]; });
   double avg = (raw[1] > 0) ? (double) raw[0] / raw[1] : 0.0;

Here, its not quite so easy to just say
   .toDouble()
on the result, so you have to do an extra step.  Really, what you want 
to be able to to do is roll both into a Collector, where the internal 
state (array of longs) is hidden and only the result is exposed.  We get 
that.

Where this falls apart is when you want to do this as the downstream 
reduction in a composed reduction, such as "group transactions by 
salesman and compute average sale".  You want to get a Map<Salesman, 
Double> rather than a Map<Salesman, long[]>.  Right?

But, this starts to get pretty messy.  At the top level, you know when 
you're done -- because you're out of input.  So there's an obvious and 
efficient time to apply the post-transform and just return that.  But at 
the next level down, you don't know whether there are more values 
associated with a key coming.  So you have to instantiate something like 
a Map<Salesman, long[]>, and then transform it into a Map<Salesman, 
Double>.  And if you have a three-level groupBy going on, it gets worse.

This is messy.  You have two choices: create a new Map (potentially 
hugely expensive) or create a view map (potentally hugely memory 
wasteful, potentially CPU-wasteful if an expensive transform has to be 
recomputed for multiple gets of the same key, and potentially moves a 
significant fraction of the computation until after the user thinks it 
is over.)  And the library doesn't have the data with which to choose 
sensibly between these approaches (you need to know how expensive the 
"before" in-memory representation is, and how expensive the transform 
is, and make tradeoffs between them.)

Further, the "single post-transform" is also likely to be a sequential 
bottleneck.  Since you don't apply the post transform to the leaves of 
the computation tree, but only the root, the obvious approach leads you 
to doing it sequentially.  Which likely kills any parallelism you would 
have gotten.

None of these are insurmountable engineering problems, but extending the 
Collector API to support all these cases takes what is a mostly simple 
API and turns it into something much uglier.  We spent a lot of time 
exploring this and did not come up with something that was acceptable. 
Given that the user has far more information on what the right choice is 
than the library does, it made sense to just let the user handle this, 
despite the regrettable loss of fluency.  But making the Collector API 
far more complicated seemed worse.

> Another pattern that I use is to take an immutable Collection<T> and wrap
> it with my own type to provide methods in the domain language, e.g.:
>
>      Trades trades = new Trades(Arrays.asList(trade1, trade2, ...));
>      ...
>      Date dateOfEarliest = trades.earliest().getTradedAtDate();
>
> where the implementation of earliest() internally uses collection-like
> operations (e.g. min operation with a comparator) but hides them behind a
> method expressed in the domain language. I would like, but am currently
> unable, to produce a new Trades() instance at the end of the collect()
> method. The point of this example is that while we can talk of unmodifiable
> and synchronized wrappers, etc, which could be provided in the JDK, this is
> a use case that would be impossible for the JDK to provide.

Right.  You collect() to a Collection and then wrap with a Trades.  You 
could do it like this:

   Collection c = stream...collect();
   Trades t = new Trades(c);

or

   Trades t = new Trades(stream...collect());

or

   Stream s = stream...stuff...
   Trades t = new Trades(s.collect(...));

>   - there's already several forms of collect(), another version adding a
> method with another parameter to provide the final conversion would spend
> some of the complexity budget

It's more than that.  Having a form

   <R, Z> Z collect(Collector<T, R> c, Function<R,Z> f)

is not so bad -- if it had a lot of value, I'd surely consider one more 
form.  And the implementation is obviously trivial.  But where's the 
value?  It just saves the caller from having to do:

   Z z = f.apply(r);

after the collect.  And it will be sequential, even on a parallel 
pipeline, which may be surprising to some users.

Basically:
  - In the cases where this works, its trivial for the caller to do it 
themselves;
  - In the cases where it's not trivial for the caller to do it 
themselves, it requires a great deal of API complexity to capture all 
the possibilities.




More information about the lambda-dev mailing list