accumulate locally and mutatively but combine concurrently (a use case)
Brian Goetz
brian.goetz at oracle.com
Fri May 10 15:44:19 PDT 2013
> But, the term "CONCURRENT" (as a mode bit, as defined in Collector)
> doesn't seem to fit into my use case at all, so I see how that word
> hurts the clarity of my question.
By default, everything about streams has the framework provide the
necessarily isolation, partitioning, and serial thread-confinement. The
enormous dividend from that is: you can do bulk data-parallel operations
on *non-thread-safe, parallelism-ignorant* data structures, like
ArrayList, as long as you follow a few simple rules:
- Don't mutate your data source while you're querying it (i.e., you do
queries on effectively immutable sources, which doesn't turn out to be a
big restriction, in practice);
- Provide well-behaved functions to the stream methods (the lambdas
passed to filter, map, etc.) Here, well-behaved can be approximated by
"pure" or "side-effect-free"; the actual definition is slightly more
complicated, but if you follow either of these sensible guidlines,
you're safe. (In other words, don't do stupid stuff like use stateful
predicates or mappers.)
Our approach to mutable reduction is the same; you assume that your
basic mutable builders like ArrayList are not thread-safe, have the
framework provide isolation / serial confinement / safe handoffs, and
you can use these non-thread-safe data structures as targets for mutable
reductions as well as sources.
We think this is a huge adoption benefit; most of the world's data is
locked up in ArrayList and HashMap, so if users had to move their data
just to query it, that would be placing a hurdle in front of them.
But..there are also (courtesy of Doug) some data structures that are
designed for concurrent modification, like ConcurrentHashMap. In some
cases, doing a groupBy by blasting elements into the same CHM from many
threads at once is more performant than making a bunch of isolated small
maps and merging them, because merge-by-key is not something that
HashMap/TreeMap do all that well.
So CONCURRENT means: we don't need no stinkin' isolation. There's only
one target; many threads blast data into it; then you're done and you
return that. Currently, CHM is really the only such beast, but its a
workhorse. So everything about the CONCURRENT bit in Collector (as well
as in Spliterator) is extrapolated from that.
> The use case could be called "accumulate locally and mutatively but
> combine functionally and non-mutatively".
The design of Collector can handle that without modification, then, but
what you lose is the opportunity to seal/trim/freeze. If your combiner
function always creates a new object, you're good.
For example, imagine we had a method:
static<T> List<T> concatView(List<T> a, List<T> b) { ... }
Then you could specify resultSupplier = ArrayList::new, accumulator =
ArrayList::add, combiner = Collections::concatView.
The combiner could do its own sealing/trimming/freezing if it likes.
So, you might have what you want already?
More information about the lambda-dev
mailing list