Forms for reduce() -- part 2

Brian Goetz brian.goetz at oracle.com
Mon Dec 10 13:46:36 PST 2012


This second note is about the proposed overhaul of groupBy/reduceBy. 
I've already outlined why I don't like these methods -- they both tie 
the API to Map/Collection, while at the same time giving users 
relatively limited ability to handle more than a few simple cases.

We don't *need* groupBy/reduceBy, since groupBy is just reduceBy (with 
the row reducer being reduce(ArrayList::new, ArrayList::add)) and 
reduceBy is just reduce, with a suitably complicated implementation of 
the reducer functions.  But we don't want users to have to build their 
own groupBy/reduceBy out of pure reduce -- it's too much work, too 
error-prone, etc.  We should provide canned versions.  The current 
versions are an example of what we might do, but I think we can do much 
better.

There is also the issue that I don't think we've correctly surfaced the 
issues surrounding when encounter order must be maintained, and when we 
can relax these for better performance.  Currently, we guess whether a 
stream has a defined encounter order or not based on its source; Lists 
do, non-sorted-Sets don't.  But this is only indirectly coupled to 
whether the user *cares* about this order or not.  The "unordered()" 
hack is helpful, but its a hack.  The second factor here is 
associativity.  If your reducing function is merely associative, which 
is the only reasonable assumption, then you MUST care about order.  But 
many are also commutative (sum, min, max).  The user knows this, but the 
framework currently does not, so it has to be conservative.

The two ways to do a mutable reduce are:
  - Use mutable containers for each leaf, and merge them as you go up 
the tree;
  - Use a single, shared, concurrent mutable container (like a CHM.)

IF you don't care about encounter order, OR your reduction is 
commutative, you can use the latter approach -- which is more like a 
forEach than a reduce.

Currently some ops have two implementations, a merging one and a 
contenting one, keyed off of ORDERED.  This is both complex for us to 
maintain and also not always what the user wants.  Better to let the 
user just say.


Since all problems are solved by another layer of indirection, I will 
propose a new abstraction -- "tabulation".  Tabulation is the process of 
creating a structured description of an input stream.  Tabulations could 
describe:
  - groupBy -- create a Map<U,Collection<T>> given a classifying 
function T->U
  - reduceBy -- create a Map<U,V> given a classifying function T->U and 
a reducer that reduces a stream of T to V
  - partition -- partition elements into two lists/arrays/etc based on 
some predicate
  - materialized function application / joins -- given a function T->U, 
produce a Map<T,U> whose keys are the values the stream and whose values 
are the result of applying the function
  - more sophistication combinations of the above, such as a two-level 
groupBy or reduceBy (Map<Buyer, Map<Seller, Collection<Transaction>>)

The latter cannot be handled at all with our current tools, nor can the 
current tools produce a Guava Multimap instead of a HashMap, or do Don's 
"groupByMulti".  The only benefit to the current groupBy/reduceBy tools 
is that they automate some of the mechanics of producing fake multimaps. 
  And even this advantage mostly goes away with the addition of 
Map.{merge,compute,computeIfAbsent}.


Use cases -- tabulation
-----------------------

Given the following domain model, here are some use cases for tabulation:

class Document {
     Author author();
     Editor editor();
     int pages();
}


1.  Group documents by author -- given a Stream<Document>, produce a 
MapLike<Author, CollectionLike<Document>>.  Here, the user may want 
control over what kind of MapLike and what kind of CollectionLike to use.

2.  Group documents by author and editor -- given a Stream<Document>, 
produce a Map<Author, Map<Editor, Collection<Document>>.

3.  Partition documents into collections of "shorter than 50 pages" and 
"longer than 50 pages".

4.  Count of documents by author -- produce a Map<Author, Integer>.

5.  Longest doc by author -- produce a Map<Author, Document>.

6.  Sum of pages by author -- produce a Map<Author, Integer>.

7.  Authors of documents joined with additional data -- produce a 
Map<Author, V> for some function Author->V.

Our current stuff can do (1), (4), (5), and (6), but not (2), (3), or 
(7) easily.


Let's assume we have an interface that describes Reducer<T,R> (takes a 
set of T values and produces an R.)

Then we can create canned reducers, such as the Average from the 
previous note:

   static final Reducer<Integer,Double> AVERAGE = reducer(...)

Then we define:

   interface MutableTabulator<T,R> extends Reducer<T,R> { }

And we define combinators for tabulators like:

     // traditional groupBy
     // will want versions to default to HashMap/ArrayList
     static <T, U,
             C extends Collection<T>,
             M extends Map<U, C>>
     Tabulator<T, M>
     groupBy(Function<T,U> classifier,
             Supplier<M> mapFactory,
             Supplier<C> rowFactory) { ... }

     // nested groupBy
     static <T, U, D,
             M extends Map<U, D>>
     Tabulator<T, M>
     groupBy(Function<T,U> classifier,
             Supplier<M> mapFactory,
             Tabulator<T, D> downstreamTabulator) { ... }

     // Take a Stream<T> and a Function<T,U> and create Map<T,U>
     static <T, U, M extends Map<T,U>>
     Tabulator<T, M>
     mappedTo(Function<T,U> mapper, Supplier<M> mapFactory) { ... }

     // What we were calling reduceBy
     // Will want other reduce variants
     static<T, U, I,
            M extends Map<U,I>>
     Tabulator<T,M>
     groupedReduce(Function<T,U> classifier,
                   Supplier<I> baseFactory,
                   BiBlock<I, T> acceptElement,
                   BiBlock<I, I> combiner) { }

These are easy to define and users can define their own.  We'll have a 
dozen or two tabulator factories and combinators, not so bad, and if we 
forget one, no worry.  Guava can easily define a groupBy that groups to 
a MultiMap.  Etc.


So, our use cases using these:

1.  Group documents by author

Map<Author, List<Document>> m
     = docs.tabulate(groupBy(Document::author,
                             HashMap::new, ArrayList::new));

2.  Group documents by author and editor

Map<Author, Map<Editor, List<Document>>> m
     = docs.tabulate(groupBy(Document::author, HashMap::new,
                             groupBy(Document::editor));

3.  Partition documents into collections of "shorter than 50 pages" and 
"longer than 50 pages"

List<List<Document>> l
     = docs.tabulate(partitionBy(d -> d.pages() <= 50),
                     ArrayList::new, ArrayList::add);

4.  Count of documents by author -- produce a Map<Author, Integer>.

Map<Author, Integer> m
    = docs.tabulate(groupedReduce(Document::author,
                                  () -> 0,
                                  (s, d) -> s + 1
                                  Integer::plus));

5.  Longest doc by author -- produce a Map<Author, Document>.

Map<Author, Integer> m
    = docs.tabulate(groupedReduce(Document::author,
                       (d1, d2) -> (d1.pages() >= d2.pages()) ? d1 : d2,
                       (d1, d2) -> (d1.pages() >= d2.pages()) ? d1 : d2)

6.  Sum of pages by author -- produce a Map<Author, Integer>.

Map<Author, Integer> m
    = docs.tabulate(groupedReduce(Document::author,
                                  () -> 0,
                                  (s, d) -> s + d.pages()
                                  Integer::plus));

7.  Authors of documents joined with additional data -- produce a 
Map<Author, V> for some function Author->V.

Map<Customer, Integer> m =
     documents.map(Document::author).uniqueElements()
              .tabulate(mapJoin(a -> f(a));


Overall, I think these forms are fairly usable.  There are a pile of 
factories and combinators for reducers/tabulators, which can be combined 
to do whatever you want.  Users can create new ones, or can compose them 
into canned tabulators.


The last bit is the separation of fold-style functional merging from 
contention-based "toss it in a big ConcurrentHashMap."  I think the 
answer here is to have two forms of tabulators.  Let's call them 
Tabulator and ConcurrentTabulator for sake of discussion.  For each 
canned tabulator form, there'd be a merging and a concurrent form.  Now 
the choice is in the user's hands; if they don't care about encounter 
order or have a better-than-associative combining function, they can 
select the latter, which is more like a forEach than a reduce.


It sounds like a lot of new surface area, but it really isn't.  We need 
a few forms of reduce, an abstraction for Reducer, an abstraction for 
Tabulator, and some factories and combinators which are individually 
trivial to write.  It move a lot of the "can't define your own stream 
ops" problems into a domain where users can define their own reducers 
and tabulators.



More information about the lambda-libs-spec-observers mailing list