Allowing wrapping/transforming of the final result of Stream.collect()

Tue Apr 16 08:34:43 PDT 2013

Hi Brian,

The value lost is only in the fluency of the statements. Everything I want
to do is already trivially achievable, just at a slightly reduced
readability.

Entirely satisfied with your reasoning, and appreciate the response.

Cheers,
Graham

On 16 April 2013 14:55, Brian Goetz <brian.goetz at oracle.com> wrote:

> AFAICT Stream's into() has been completely superceded by collect(), and
>> Stream.Destination does not exist in b83.
>>
>
> Correct.
>
>
>  Does an equivalent now exist with collect()? I could not find an obvious
>> way to do this, as both accumulator and combiner are expected to operate
>> on
>> elements or partial results while collecting is in progress, but are not
>> given the chance to transform the final result.
>>
>
> This is correct.
>
>
>  If there is no equivalent, are there plans to add one? It would be a shame
>> to miss this out, when I imagine it would be well used, not just for
>> immutable collections, but also their cousins, the unmodifiable wrappers,
>> as well as synchronized wrappers, and creating sets from maps
>> (Collections.newSetFromMap, a common way to obtain a concurrent hash
>> set[1]).
>>
>
> Agreed that it is a shame.  We spent a lot of time investigating this, and
> it added more complexity than it first appeared, so we retreated to the
> current state.
>
> The easy case is easy: you want to apply a transform to the final result.
>  As in:
>
>   StringBuilder sb = stuff.map(Object::toString)
>                           .collect(toStringBuilder());
>   String result = sb.toString();
>
> Here, the Collector does not return what you actually want.  Of course, in
> this case, its easy to do it unobstrusively:
>
>   String result = stuff.map(Object::toString)
>                        .collect(toStringBuilder())
>                        .toString();
>
> But there are other cases where you want to do slightly more, and would
> like to apply a function to the result.  For example, its easy to compute
> average with an array of two longs:
>
>   long[] raw
>       = stuff.collect(() -> new long[2],
>                       (la, x) -> { la[0] += x; la[1]++; },
>                       (la, lb) -> { la[0] += lb[0]; la[1] += lb[2]; });
>   double avg = (raw[1] > 0) ? (double) raw[0] / raw[1] : 0.0;
>
> Here, its not quite so easy to just say
>   .toDouble()
> on the result, so you have to do an extra step.  Really, what you want to
> be able to to do is roll both into a Collector, where the internal state
> (array of longs) is hidden and only the result is exposed.  We get that.
>
> Where this falls apart is when you want to do this as the downstream
> reduction in a composed reduction, such as "group transactions by salesman
> and compute average sale".  You want to get a Map<Salesman, Double> rather
> than a Map<Salesman, long[]>.  Right?
>
> But, this starts to get pretty messy.  At the top level, you know when
> you're done -- because you're out of input.  So there's an obvious and
> efficient time to apply the post-transform and just return that.  But at
> the next level down, you don't know whether there are more values
> associated with a key coming.  So you have to instantiate something like a
> Map<Salesman, long[]>, and then transform it into a Map<Salesman, Double>.
>  And if you have a three-level groupBy going on, it gets worse.
>
> This is messy.  You have two choices: create a new Map (potentially hugely
> expensive) or create a view map (potentally hugely memory wasteful,
> potentially CPU-wasteful if an expensive transform has to be recomputed for
> multiple gets of the same key, and potentially moves a significant fraction
> of the computation until after the user thinks it is over.)  And the
> library doesn't have the data with which to choose sensibly between these
> approaches (you need to know how expensive the "before" in-memory
> representation is, and how expensive the transform is, and make tradeoffs
> between them.)
>
> Further, the "single post-transform" is also likely to be a sequential
> bottleneck.  Since you don't apply the post transform to the leaves of the
> computation tree, but only the root, the obvious approach leads you to
> doing it sequentially.  Which likely kills any parallelism you would have
> gotten.
>
> None of these are insurmountable engineering problems, but extending the
> Collector API to support all these cases takes what is a mostly simple API
> and turns it into something much uglier.  We spent a lot of time exploring
> this and did not come up with something that was acceptable. Given that the
> user has far more information on what the right choice is than the library
> does, it made sense to just let the user handle this, despite the
> regrettable loss of fluency.  But making the Collector API far more
> complicated seemed worse.
>
>
>  Another pattern that I use is to take an immutable Collection<T> and wrap
>> it with my own type to provide methods in the domain language, e.g.:
>>
>>      Trades trades = new Trades(Arrays.asList(trade1, trade2, ...));
>>      ...
>>      Date dateOfEarliest = trades.earliest().**getTradedAtDate();
>>
>> where the implementation of earliest() internally uses collection-like
>> operations (e.g. min operation with a comparator) but hides them behind a
>> method expressed in the domain language. I would like, but am currently
>> unable, to produce a new Trades() instance at the end of the collect()
>> method. The point of this example is that while we can talk of
>> unmodifiable
>> and synchronized wrappers, etc, which could be provided in the JDK, this
>> is
>> a use case that would be impossible for the JDK to provide.
>>
>
> Right.  You collect() to a Collection and then wrap with a Trades.  You
> could do it like this:
>
>   Collection c = stream...collect();
>   Trades t = new Trades(c);
>
> or
>
>   Trades t = new Trades(stream...collect());
>
> or
>
>   Stream s = stream...stuff...
>   Trades t = new Trades(s.collect(...));
>
>
>    - there's already several forms of collect(), another version adding a
>> method with another parameter to provide the final conversion would spend
>> some of the complexity budget
>>
>
> It's more than that.  Having a form
>
>   <R, Z> Z collect(Collector<T, R> c, Function<R,Z> f)
>
> is not so bad -- if it had a lot of value, I'd surely consider one more
> form.  And the implementation is obviously trivial.  But where's the value?
>  It just saves the caller from having to do:
>
>   Z z = f.apply(r);
>
> after the collect.  And it will be sequential, even on a parallel
> pipeline, which may be surprising to some users.
>
> Basically:
>  - In the cases where this works, its trivial for the caller to do it
> themselves;
>  - In the cases where it's not trivial for the caller to do it themselves,
> it requires a great deal of API complexity to capture all the possibilities.
>
>
>