Primitive streams

Fri Dec 28 09:55:24 PST 2012

The implementation currently has two versions of streams, reference and 
integer.  Let's checkpoint on the primitive specialization strategy, 
since it does result in a fair amount of code and API bloat (though not 
as bad as it looks, since many of the currently public abstractions will 
be made private.)

So, let's start with the argument for specialized streams at all.

1.  Boxing costs.  Doing calculations like "sum of squares" in boxed 
world is awful:

   int sumOfWeights = foos.map(Foo::weight).reduce(0, Integer::sum);

Here, all the weights will be boxed and unboxed just to add them up. 
Figure a 10x performance hit for that in the (many) cases where the VM 
doesn't save us.

It is possible to mitigate this somewhat by having fused mapReduce 
methods, which we tried early on, such as :

   foos.mapReduce(Foo::getWeight, 0, Integer::sum)

Here, at least now all the reduction is happening in the unboxed domain. 
  But the API is now nastier, and while the above is readable, it gets 
worse in less trivial examples where there are more mapper and reducer 
lambdas being passed as arguments and its not obvious which is which. 
Plus the explosion of mapReduce forms: { Obj,int,long,double } x { 
reduce forms }.  Plus the combination of map, reduce, and fused 
mapReduce leaves users wondering when they should do which.  All to work 
around boxing.

This can be further mitigated by specialized fused operations for the 
most common reductions: sumBy(IntMapper), maxBy(IntMapper), etc. 
(Price: more overloads, more "when do I use what" confusion.)

So, summary so far: we can mitigate boxing costs by cluttering the API 
with lots of extra methods.  (But I don't think that gets us all the way.)

2.  Approachability.  Telling Java developers that the way to add up a 
bunch of numbers is to first recognize that integers form a monoid is 
likely to make them feel like the guy in this cartoon:

   http://howfuckedismydatabase.com/nosql/

Reduce is wonderful and powerful and going to confuse the crap out of 
80+% of Java developers.  (This was driven home to me dramatically when 
I went on the road with my "Lambdas for Java" talk and saw blank faces 
when I got to "reduce", even from relatively sophisticated audiences. 
It took a lot of tweaking -- and explaining -- to get it to the point 
where I didn't get a room full of blank stares.)

Simply put: I believe the letters "s-u-m" have to appear prominently in 
the API.  When people are ready, they can learn to see reduce as a 
generalization of sum(), but not until they're ready.  Forcing them to 
learn reduce() prematurely will hurt adoption.  (The sumBy approach 
above helps here too, again at a cost.)

3.  Numerics.  Adding up doubles is not as simple as reducing with 
Double::sum (unless you don't care about accuracy.)  Having methods for 
numeric sums gives us a place to put such intelligence; general reduce 
does not.

4.  "Primitives all the way down".  While fused+specialized methods will 
mitigate many of the above, it only helps at the very end of the chain. 
  It doesn't help things farther up, where we often just want to 
generate streams of integers and operate on them as integers.  Like:

   intRange(0, 100).map(...).filter(...).sorted().forEach(...)

or

   integers().map(x -> x*x).limit(100).sum()

We've currently got a (mostly) complete implementation of integer 
streams.  The actual operation implementations are surprisingly thin, 
and many can share significant code across stream types (e.g., there's 
one implementation of MatchOp, with relatively small adapters for 
Of{Reference,Int,..}).  Where most of the code bloat is is in the 
internal supporting classes (such as the internal Node classes we use to 
build conc trees) and the spillover into public interfaces 
(PrimitiveIterator.Of{Int,Long,Double}).

Historically we've shied away from giving users useful tools for 
operating on primitives because we were afraid of the combinatorial 
explosion: IntList, IntArrayList, DoubleSortedSynchronizedTreeList, etc. 
  While the explosion exists with streams too, we've managed to limit it 
to something that is tolerable, and can finally give users some useful 
tools for working with numeric calculations.

We've already limited the explosion to just doing int/long/double 
instead of the full eight.  We could pare further to just long/double, 
since ints can fit easily into longs and most processors are 64-bit at 
this point anyway.