Primitive streams

Sat Dec 29 10:36:26 PST 2012

> (simulating with LongStream, as we do for short, byte, and char)

Because the other primitive types listed are all currently simulated with
*IntStream* then dropping IntStream would be especially painful in terms of
memory footprint.

More painful is that there are many methods (not to mention array indexing)
that is tied to the int primitive type, and longs won't fit there without
some sort of explicit down-casting, which will probably be of concern to
static analysis tools (FindBugs).

Joe

On Sat, Dec 29, 2012 at 10:25 AM, Brian Goetz <brian.goetz at oracle.com>wrote:

> Summary: Remi says:
>
>  - Yes, unfortunately we need primitive streams
>  - Given primitive streams, fused ops are just extra complexity
>  - Dropping IntStream (simulating with LongStream, as we do for short,
> byte, and char) is a questionable economy
>
> Other opinions?
>
>
> On 12/29/2012 1:00 PM, Remi Forax wrote:
>
>> On 12/28/2012 06:55 PM, Brian Goetz wrote:
>>
>>> The implementation currently has two versions of streams, reference
>>> and integer.  Let's checkpoint on the primitive specialization
>>> strategy, since it does result in a fair amount of code and API bloat
>>> (though not as bad as it looks, since many of the currently public
>>> abstractions will be made private.)
>>>
>>> So, let's start with the argument for specialized streams at all.
>>>
>>> 1.  Boxing costs.  Doing calculations like "sum of squares" in boxed
>>> world is awful:
>>>
>>>   int sumOfWeights = foos.map(Foo::weight).reduce(**0, Integer::sum);
>>>
>>> Here, all the weights will be boxed and unboxed just to add them up.
>>> Figure a 10x performance hit for that in the (many) cases where the VM
>>> doesn't save us.
>>>
>>> It is possible to mitigate this somewhat by having fused mapReduce
>>> methods, which we tried early on, such as :
>>>
>>>   foos.mapReduce(Foo::getWeight, 0, Integer::sum)
>>>
>>> Here, at least now all the reduction is happening in the unboxed
>>> domain.  But the API is now nastier, and while the above is readable,
>>> it gets worse in less trivial examples where there are more mapper and
>>> reducer lambdas being passed as arguments and its not obvious which is
>>> which. Plus the explosion of mapReduce forms: { Obj,int,long,double }
>>> x { reduce forms }.  Plus the combination of map, reduce, and fused
>>> mapReduce leaves users wondering when they should do which.  All to
>>> work around boxing.
>>>
>>> This can be further mitigated by specialized fused operations for the
>>> most common reductions: sumBy(IntMapper), maxBy(IntMapper), etc.
>>> (Price: more overloads, more "when do I use what" confusion.)
>>>
>>> So, summary so far: we can mitigate boxing costs by cluttering the API
>>> with lots of extra methods.  (But I don't think that gets us all the
>>> way.)
>>>
>>
>> But given that the inference algorithm and the lambda conversion
>> algorithm don't consider Integer as a boxed int (unlike applicable
>> method resolution by example),
>> we need IntFunction, IntOperator etc.
>> If we have these specialized function interfaces, having specialized
>> stream is not as if we have a choice.
>> The choice was done long before, when the lambda EG decide how
>> inference/lambda conversion works.
>>
>> Now, I dislike fused operations because it goes against the DRY
>> principle, the stream interface should be as simple as possible, so an
>> operation should never be a compound of several ones and while the
>> pipeline can hardly optimize to transform boxing to primitive operation,
>> fuzing operations for performance inside the pipeline implementation is
>> easy.
>> So instead of stream.sumBy(IntMapper), we already have:
>> stream.map(IntMapper).sum(). If the pipeline prefer to use fuzed
>> operation, it's an implementation detail.
>>
>>
>>>
>>> 2.  Approachability.  Telling Java developers that the way to add up a
>>> bunch of numbers is to first recognize that integers form a monoid is
>>> likely to make them feel like the guy in this cartoon:
>>>
>>>   http://howfuckedismydatabase.**com/nosql/<http://howfuckedismydatabase.com/nosql/>
>>>
>>> Reduce is wonderful and powerful and going to confuse the crap out of
>>> 80+% of Java developers.  (This was driven home to me dramatically
>>> when I went on the road with my "Lambdas for Java" talk and saw blank
>>> faces when I got to "reduce", even from relatively sophisticated
>>> audiences. It took a lot of tweaking -- and explaining -- to get it to
>>> the point where I didn't get a room full of blank stares.)
>>>
>>> Simply put: I believe the letters "s-u-m" have to appear prominently
>>> in the API.  When people are ready, they can learn to see reduce as a
>>> generalization of sum(), but not until they're ready.  Forcing them to
>>> learn reduce() prematurely will hurt adoption.  (The sumBy approach
>>> above helps here too, again at a cost.)
>>>
>>
>> yes, we need sum.
>>
>>
>>>
>>> 3.  Numerics.  Adding up doubles is not as simple as reducing with
>>> Double::sum (unless you don't care about accuracy.)  Having methods
>>> for numeric sums gives us a place to put such intelligence; general
>>> reduce does not.
>>>
>>
>> I'm always afraid when someone try to put "intelligence" in a program.
>> We never have the same.
>>
>>
>>>
>>> 4.  "Primitives all the way down".  While fused+specialized methods
>>> will mitigate many of the above, it only helps at the very end of the
>>> chain.  It doesn't help things farther up, where we often just want to
>>> generate streams of integers and operate on them as integers.  Like:
>>>
>>>   intRange(0, 100).map(...).filter(...).**sorted().forEach(...)
>>>
>>> or
>>>
>>>   integers().map(x -> x*x).limit(100).sum()
>>>
>>>
>>>
>>> We've currently got a (mostly) complete implementation of integer
>>> streams.  The actual operation implementations are surprisingly thin,
>>> and many can share significant code across stream types (e.g., there's
>>> one implementation of MatchOp, with relatively small adapters for
>>> Of{Reference,Int,..}).  Where most of the code bloat is is in the
>>> internal supporting classes (such as the internal Node classes we use
>>> to build conc trees) and the spillover into public interfaces
>>> (PrimitiveIterator.Of{Int,**Long,Double}).
>>>
>>
>> Correctly if i'm wrong but PrimitiveIterator are just here because of
>> the escape hatch, it's a huge cost for something that will not be used
>> very often.
>> I'm not sure we should provide these public interface and see if you can
>> not do better for Java 9.
>>
>>
>>> Historically we've shied away from giving users useful tools for
>>> operating on primitives because we were afraid of the combinatorial
>>> explosion: IntList, IntArrayList, DoubleSortedSynchronizedTreeLi**st,
>>> etc.  While the explosion exists with streams too, we've managed to
>>> limit it to something that is tolerable, and can finally give users
>>> some useful tools for working with numeric calculations.
>>>
>>>
>>> We've already limited the explosion to just doing int/long/double
>>> instead of the full eight.  We could pare further to just long/double,
>>> since ints can fit easily into longs and most processors are 64-bit at
>>> this point anyway.
>>>
>>>
>> processor are 64bits but using ints is still faster than long because
>> there is less bus traffic.
>>
>> Rémi
>>
>>