Fwd: Forms for reduce() -- part 2
Brian Goetz
brian.goetz at oracle.com
Thu Jan 3 14:32:50 PST 2013
This got caught in the filters -- its old and probably out of date, but
should go onto the record.
-------- Original Message --------
Subject: Forms for reduce() -- part 2
Date: Mon, 10 Dec 2012 16:41:38 -0500
From: Brian Goetz <brian at briangoetz.com>
To: lambda-libs-spec-experts at openjdk.java.net
<lambda-libs-spec-experts at openjdk.java.net>
This second note is about the proposed overhaul of groupBy/reduceBy.
I've already outlined why I don't like these methods -- they both tie
the API to Map/Collection, while at the same time giving users
relatively limited ability to handle more than a few simple cases.
We don't *need* groupBy/reduceBy, since groupBy is just reduceBy (with
the row reducer being reduce(ArrayList::new, ArrayList::add)) and
reduceBy is just reduce, with a suitably complicated implementation of
the reducer functions. But we don't want users to have to build their
own groupBy/reduceBy out of pure reduce -- it's too much work, too
error-prone, etc. We should provide canned versions. The current
versions are an example of what we might do, but I think we can do much
better.
There is also the issue that I don't think we've correctly surfaced the
issues surrounding when encounter order must be maintained, and when we
can relax these for better performance. Currently, we guess whether a
stream has a defined encounter order or not based on its source; Lists
do, non-sorted-Sets don't. But this is only indirectly coupled to
whether the user *cares* about this order or not. The "unordered()"
hack is helpful, but its a hack. The second factor here is
associativity. If your reducing function is merely associative, which
is the only reasonable assumption, then you MUST care about order. But
many are also commutative (sum, min, max). The user knows this, but the
framework currently does not, so it has to be conservative.
The two ways to do a mutable reduce are:
- Use mutable containers for each leaf, and merge them as you go up
the tree;
- Use a single, shared, concurrent mutable container (like a CHM.)
IF you don't care about encounter order, OR your reduction is
commutative, you can use the latter approach -- which is more like a
forEach than a reduce.
Currently some ops have two implementations, a merging one and a
contenting one, keyed off of ORDERED. This is both complex for us to
maintain and also not always what the user wants. Better to let the
user just say.
Since all problems are solved by another layer of indirection, I will
propose a new abstraction -- "tabulation". Tabulation is the process of
creating a structured description of an input stream. Tabulations could
describe:
- groupBy -- create a Map<U,Collection<T>> given a classifying
function T->U
- reduceBy -- create a Map<U,V> given a classifying function T->U and
a reducer that reduces a stream of T to V
- partition -- partition elements into two lists/arrays/etc based on
some predicate
- materialized function application / joins -- given a function T->U,
produce a Map<T,U> whose keys are the values the stream and whose values
are the result of applying the function
- more sophistication combinations of the above, such as a two-level
groupBy or reduceBy (Map<Buyer, Map<Seller, Collection<Transaction>>)
The latter cannot be handled at all with our current tools, nor can the
current tools produce a Guava Multimap instead of a HashMap, or do Don's
"groupByMulti". The only benefit to the current groupBy/reduceBy tools
is that they automate some of the mechanics of producing fake multimaps.
And even this advantage mostly goes away with the addition of
Map.{merge,compute,computeIfAbsent}.
Use cases -- tabulation
-----------------------
Given the following domain model, here are some use cases for tabulation:
class Document {
Author author();
Editor editor();
int pages();
}
1. Group documents by author -- given a Stream<Document>, produce a
MapLike<Author, CollectionLike<Document>>. Here, the user may want
control over what kind of MapLike and what kind of CollectionLike to use.
2. Group documents by author and editor -- given a Stream<Document>,
produce a Map<Author, Map<Editor, Collection<Document>>.
3. Partition documents into collections of "shorter than 50 pages" and
"longer than 50 pages".
4. Count of documents by author -- produce a Map<Author, Integer>.
5. Longest doc by author -- produce a Map<Author, Document>.
6. Sum of pages by author -- produce a Map<Author, Integer>.
7. Authors of documents joined with additional data -- produce a
Map<Author, V> for some function Author->V.
Our current stuff can do (1), (4), (5), and (6), but not (2), (3), or
(7) easily.
Let's assume we have an interface that describes Reducer<T,R> (takes a
set of T values and produces an R.)
Then we can create canned reducers, such as the Average from the
previous note:
static final Reducer<Integer,Double> AVERAGE = reducer(...)
Then we define:
interface MutableTabulator<T,R> extends Reducer<T,R> { }
And we define combinators for tabulators like:
// traditional groupBy
// will want versions to default to HashMap/ArrayList
static <T, U,
C extends Collection<T>,
M extends Map<U, C>>
Tabulator<T, M>
groupBy(Function<T,U> classifier,
Supplier<M> mapFactory,
Supplier<C> rowFactory) { ... }
// nested groupBy
static <T, U, D,
M extends Map<U, D>>
Tabulator<T, M>
groupBy(Function<T,U> classifier,
Supplier<M> mapFactory,
Tabulator<T, D> downstreamTabulator) { ... }
// Take a Stream<T> and a Function<T,U> and create Map<T,U>
static <T, U, M extends Map<T,U>>
Tabulator<T, M>
mappedTo(Function<T,U> mapper, Supplier<M> mapFactory) { ... }
// What we were calling reduceBy
// Will want other reduce variants
static<T, U, I,
M extends Map<U,I>>
Tabulator<T,M>
groupedReduce(Function<T,U> classifier,
Supplier<I> baseFactory,
BiBlock<I, T> acceptElement,
BiBlock<I, I> combiner) { }
These are easy to define and users can define their own. We'll have a
dozen or two tabulator factories and combinators, not so bad, and if we
forget one, no worry. Guava can easily define a groupBy that groups to
a MultiMap. Etc.
So, our use cases using these:
1. Group documents by author
Map<Author, List<Document>> m
= docs.tabulate(groupBy(Document::author,
HashMap::new, ArrayList::new));
2. Group documents by author and editor
Map<Author, Map<Editor, List<Document>>> m
= docs.tabulate(groupBy(Document::author, HashMap::new,
groupBy(Document::editor));
3. Partition documents into collections of "shorter than 50 pages" and
"longer than 50 pages"
List<List<Document>> l
= docs.tabulate(partitionBy(d -> d.pages() <= 50),
ArrayList::new, ArrayList::add);
4. Count of documents by author -- produce a Map<Author, Integer>.
Map<Author, Integer> m
= docs.tabulate(groupedReduce(Document::author,
() -> 0,
(s, d) -> s + 1
Integer::plus));
5. Longest doc by author -- produce a Map<Author, Document>.
Map<Author, Integer> m
= docs.tabulate(groupedReduce(Document::author,
(d1, d2) -> (d1.pages() >= d2.pages()) ? d1 : d2,
(d1, d2) -> (d1.pages() >= d2.pages()) ? d1 : d2)
6. Sum of pages by author -- produce a Map<Author, Integer>.
Map<Author, Integer> m
= docs.tabulate(groupedReduce(Document::author,
() -> 0,
(s, d) -> s + d.pages()
Integer::plus));
7. Authors of documents joined with additional data -- produce a
Map<Author, V> for some function Author->V.
Map<Customer, Integer> m =
documents.map(Document::author).uniqueElements()
.tabulate(mapJoin(a -> f(a));
Overall, I think these forms are fairly usable. There are a pile of
factories and combinators for reducers/tabulators, which can be combined
to do whatever you want. Users can create new ones, or can compose them
into canned tabulators.
The last bit is the separation of fold-style functional merging from
contention-based "toss it in a big ConcurrentHashMap." I think the
answer here is to have two forms of tabulators. Let's call them
Tabulator and ConcurrentTabulator for sake of discussion. For each
canned tabulator form, there'd be a merging and a concurrent form. Now
the choice is in the user's hands; if they don't care about encounter
order or have a better-than-associative combining function, they can
select the latter, which is more like a forEach than a reduce.
It sounds like a lot of new surface area, but it really isn't. We need
a few forms of reduce, an abstraction for Reducer, an abstraction for
Tabulator, and some factories and combinators which are individually
trivial to write. It move a lot of the "can't define your own stream
ops" problems into a domain where users can define their own reducers
and tabulators.
More information about the lambda-libs-spec-experts
mailing list