Collectors update redux

Thu Feb 7 17:11:44 PST 2013

>     I can think of uses for all of it, but I worry about someone faced
>     with picking the right static factory method of Collectors. Maybe
>     with the right class comment, users can be guided to the right
>     combinator without having to know much.

It's worth noting that the only method that is really needed is:

<R> R reduce(Supplier<R> factory,
              BiFunction<R, T, R> reducer,
              BinaryOperator<R> combiner);

All the other forms of reduce/collect can be written in terms of this 
one -- though some are more awkward than others.  Similarly, all the 
Collectors are just "macros" for specific combinations of inputs to this 
form of reduce.

And, as to the Collectors, groupBy can be written in terms of 
groupingReduce; partitioning is just grouping with a boolean-valued 
function; joiningWith is a form of groupingReduce too.  We don't *need* 
any of them.  They're all just reductions that can be expressed with the 
above form.

So we *could* boil everything down to just one method.  But, of course, 
we should not, because the client code gets harder to write, harder to 
read, and more error-prone.  Each "A can be written in terms of B" 
requires an "aha" that is obvious in hindsight but could well be slow in 
coming.

So it's really a question of "where do we turn the knob to."  The forms 
of reduce we've got are a (non-orthogonal) set that are (subjectively) 
tailored to specific categories of perceived-to-be common situations. 
Similarly, the set of Collectors is based on having scoured various "100 
cool examples with <my favorite query framework>" to distill out common 
use cases.  None of the Collectors add any "power" in the sense they can 
all be written as raw reduce; but they do add expressiveness.  Each one 
you take away makes some clearly imaginable use case harder.  And each 
one you add moves us closer to combinator overload.

For example, suppose we take away mapping(T->U, Collector<U>).  The user 
wants to compute "average sale by salesman".  He sees 
groupBy(Txn::seller), but that gives him a Collection<Txn>, not what he 
wants.  He sees groupBy(Txn::seller, Collector<Txn>), and he sees 
toStatistics which will give him the average/min/max he wants, but he 
can't bridge the two.  So he has to either do it in two passes, or write 
his own averaging reducer.  Which isn't terribly hard but he'd rather 
re-use the one in the library.

Adding in mapping(T->U, Collector<U>) lets him write

   .collect(groupBy(Txn::seller,
                    mapping(Txn::amount, toLongStatistics)))
   .getMean()

and be done -- and still readable -- and obviously correct.

For every single one of these, we could make the argument "we don't need 
it because it's ten lines of code the user could write if he needs" (all 
the Collectors are tiny); then again for every single one of them, we 
could make the argument that it's self-contained and useful for 
realistic use cases.

So in the end the "right" set will be highly subjective.  Personally, I 
think we've got just about the right set of operations, but maybe too 
many flavors of each.  (Note we already took away the flatMap-like 
flavors of groupBy, where each input element can be mapped to multiple 
output elements, which already cut the number of combinations in half.) 
  And maybe we could cut back on the variations (e.g., eliminate the 
forms that let you provide your own Map constructor, and you always just 
get a HashMap.)  Or maybe we have the right forms and flavors, but we 
need a more Builder-like API to regularize it.  Or maybe slicing them 
differently will be less confusing.  Or more confusing.

So, constructive input welcome!