Allowing wrapping/transforming of the final result of Stream.collect()

Mon Sep 2 15:48:49 PDT 2013

Hi,

Mostly for posterity: as of b105, for the unmodifiableXXX use case, it is
now possible to do:

Set<Thing> unmodifiableResult = myThings.stream()
    ...
    .collect(Collectors.collectingAndThen(Collectors.toSet(),
Collections::unmodifiableSet));

Similarly for my Trades domain object example:

    .collect(Collectors.collectingAndThen(Collectors.toList(),
Trades::new));

(looks nicer with static imports)

Glad to see this included, hope it sticks around :-)

Thanks,
Graham

On 16 April 2013 16:34, Graham Allan <grundlefleck at gmail.com> wrote:

> Hi Brian,
>
> The value lost is only in the fluency of the statements. Everything I want
> to do is already trivially achievable, just at a slightly reduced
> readability.
>
> Entirely satisfied with your reasoning, and appreciate the response.
>
> Cheers,
> Graham
>
>
> On 16 April 2013 14:55, Brian Goetz <brian.goetz at oracle.com> wrote:
>
>>  AFAICT Stream's into() has been completely superceded by collect(), and
>>> Stream.Destination does not exist in b83.
>>>
>>
>> Correct.
>>
>>
>>  Does an equivalent now exist with collect()? I could not find an obvious
>>> way to do this, as both accumulator and combiner are expected to operate
>>> on
>>> elements or partial results while collecting is in progress, but are not
>>> given the chance to transform the final result.
>>>
>>
>> This is correct.
>>
>>
>>  If there is no equivalent, are there plans to add one? It would be a
>>> shame
>>> to miss this out, when I imagine it would be well used, not just for
>>> immutable collections, but also their cousins, the unmodifiable wrappers,
>>> as well as synchronized wrappers, and creating sets from maps
>>> (Collections.newSetFromMap, a common way to obtain a concurrent hash
>>> set[1]).
>>>
>>
>> Agreed that it is a shame.  We spent a lot of time investigating this,
>> and it added more complexity than it first appeared, so we retreated to the
>> current state.
>>
>> The easy case is easy: you want to apply a transform to the final result.
>>  As in:
>>
>>   StringBuilder sb = stuff.map(Object::toString)
>>                           .collect(toStringBuilder());
>>   String result = sb.toString();
>>
>> Here, the Collector does not return what you actually want.  Of course,
>> in this case, its easy to do it unobstrusively:
>>
>>   String result = stuff.map(Object::toString)
>>                        .collect(toStringBuilder())
>>                        .toString();
>>
>> But there are other cases where you want to do slightly more, and would
>> like to apply a function to the result.  For example, its easy to compute
>> average with an array of two longs:
>>
>>   long[] raw
>>       = stuff.collect(() -> new long[2],
>>                       (la, x) -> { la[0] += x; la[1]++; },
>>                       (la, lb) -> { la[0] += lb[0]; la[1] += lb[2]; });
>>   double avg = (raw[1] > 0) ? (double) raw[0] / raw[1] : 0.0;
>>
>> Here, its not quite so easy to just say
>>   .toDouble()
>> on the result, so you have to do an extra step.  Really, what you want to
>> be able to to do is roll both into a Collector, where the internal state
>> (array of longs) is hidden and only the result is exposed.  We get that.
>>
>> Where this falls apart is when you want to do this as the downstream
>> reduction in a composed reduction, such as "group transactions by salesman
>> and compute average sale".  You want to get a Map<Salesman, Double> rather
>> than a Map<Salesman, long[]>.  Right?
>>
>> But, this starts to get pretty messy.  At the top level, you know when
>> you're done -- because you're out of input.  So there's an obvious and
>> efficient time to apply the post-transform and just return that.  But at
>> the next level down, you don't know whether there are more values
>> associated with a key coming.  So you have to instantiate something like a
>> Map<Salesman, long[]>, and then transform it into a Map<Salesman, Double>.
>>  And if you have a three-level groupBy going on, it gets worse.
>>
>> This is messy.  You have two choices: create a new Map (potentially
>> hugely expensive) or create a view map (potentally hugely memory wasteful,
>> potentially CPU-wasteful if an expensive transform has to be recomputed for
>> multiple gets of the same key, and potentially moves a significant fraction
>> of the computation until after the user thinks it is over.)  And the
>> library doesn't have the data with which to choose sensibly between these
>> approaches (you need to know how expensive the "before" in-memory
>> representation is, and how expensive the transform is, and make tradeoffs
>> between them.)
>>
>> Further, the "single post-transform" is also likely to be a sequential
>> bottleneck.  Since you don't apply the post transform to the leaves of the
>> computation tree, but only the root, the obvious approach leads you to
>> doing it sequentially.  Which likely kills any parallelism you would have
>> gotten.
>>
>> None of these are insurmountable engineering problems, but extending the
>> Collector API to support all these cases takes what is a mostly simple API
>> and turns it into something much uglier.  We spent a lot of time exploring
>> this and did not come up with something that was acceptable. Given that the
>> user has far more information on what the right choice is than the library
>> does, it made sense to just let the user handle this, despite the
>> regrettable loss of fluency.  But making the Collector API far more
>> complicated seemed worse.
>>
>>
>>  Another pattern that I use is to take an immutable Collection<T> and wrap
>>> it with my own type to provide methods in the domain language, e.g.:
>>>
>>>      Trades trades = new Trades(Arrays.asList(trade1, trade2, ...));
>>>      ...
>>>      Date dateOfEarliest = trades.earliest().**getTradedAtDate();
>>>
>>> where the implementation of earliest() internally uses collection-like
>>> operations (e.g. min operation with a comparator) but hides them behind a
>>> method expressed in the domain language. I would like, but am currently
>>> unable, to produce a new Trades() instance at the end of the collect()
>>> method. The point of this example is that while we can talk of
>>> unmodifiable
>>> and synchronized wrappers, etc, which could be provided in the JDK, this
>>> is
>>> a use case that would be impossible for the JDK to provide.
>>>
>>
>> Right.  You collect() to a Collection and then wrap with a Trades.  You
>> could do it like this:
>>
>>   Collection c = stream...collect();
>>   Trades t = new Trades(c);
>>
>> or
>>
>>   Trades t = new Trades(stream...collect());
>>
>> or
>>
>>   Stream s = stream...stuff...
>>   Trades t = new Trades(s.collect(...));
>>
>>
>>    - there's already several forms of collect(), another version adding a
>>> method with another parameter to provide the final conversion would spend
>>> some of the complexity budget
>>>
>>
>> It's more than that.  Having a form
>>
>>   <R, Z> Z collect(Collector<T, R> c, Function<R,Z> f)
>>
>> is not so bad -- if it had a lot of value, I'd surely consider one more
>> form.  And the implementation is obviously trivial.  But where's the value?
>>  It just saves the caller from having to do:
>>
>>   Z z = f.apply(r);
>>
>> after the collect.  And it will be sequential, even on a parallel
>> pipeline, which may be surprising to some users.
>>
>> Basically:
>>  - In the cases where this works, its trivial for the caller to do it
>> themselves;
>>  - In the cases where it's not trivial for the caller to do it
>> themselves, it requires a great deal of API complexity to capture all the
>> possibilities.
>>
>>
>>
>