accumulate locally and mutatively but combine concurrently (a use case)

Brian Goetz brian.goetz at oracle.com
Fri May 10 15:44:19 PDT 2013


> But, the term "CONCURRENT" (as a mode bit, as defined in Collector)
> doesn't seem to fit into my use case at all, so I see how that word
> hurts the clarity of my question.

By default, everything about streams has the framework provide the 
necessarily isolation, partitioning, and serial thread-confinement.  The 
enormous dividend from that is: you can do bulk data-parallel operations 
on *non-thread-safe, parallelism-ignorant* data structures, like 
ArrayList, as long as you follow a few simple rules:

  - Don't mutate your data source while you're querying it (i.e., you do 
queries on effectively immutable sources, which doesn't turn out to be a 
big restriction, in practice);
  - Provide well-behaved functions to the stream methods (the lambdas 
passed to filter, map, etc.)  Here, well-behaved can be approximated by 
"pure" or "side-effect-free"; the actual definition is slightly more 
complicated, but if you follow either of these sensible guidlines, 
you're safe.  (In other words, don't do stupid stuff like use stateful 
predicates or mappers.)

Our approach to mutable reduction is the same; you assume that your 
basic mutable builders like ArrayList are not thread-safe, have the 
framework provide isolation / serial confinement / safe handoffs, and 
you can use these non-thread-safe data structures as targets for mutable 
reductions as well as sources.

We think this is a huge adoption benefit; most of the world's data is 
locked up in ArrayList and HashMap, so if users had to move their data 
just to query it, that would be placing a hurdle in front of them.

But..there are also (courtesy of Doug) some data structures that are 
designed for concurrent modification, like ConcurrentHashMap.  In some 
cases, doing a groupBy by blasting elements into the same CHM from many 
threads at once is more performant than making a bunch of isolated small 
maps and merging them, because merge-by-key is not something that 
HashMap/TreeMap do all that well.

So CONCURRENT means: we don't need no stinkin' isolation.  There's only 
one target; many threads blast data into it; then you're done and you 
return that.  Currently, CHM is really the only such beast, but its a 
workhorse.  So everything about the CONCURRENT bit in Collector (as well 
as in Spliterator) is extrapolated from that.

> The use case could be called "accumulate locally and mutatively but
> combine functionally and non-mutatively".

The design of Collector can handle that without modification, then, but 
what you lose is the opportunity to seal/trim/freeze.  If your combiner 
function always creates a new object, you're good.

For example, imagine we had a method:

   static<T> List<T> concatView(List<T> a, List<T> b) { ... }

Then you could specify resultSupplier = ArrayList::new, accumulator = 
ArrayList::add, combiner = Collections::concatView.

The combiner could do its own sealing/trimming/freezing if it likes. 
So, you might have what you want already?




More information about the lambda-dev mailing list