Streams -- philosophy

Brian Goetz brian.goetz at oracle.com
Mon Dec 31 08:18:41 PST 2012


I'd like to take a quick step back to go over some of the philosophical 
goals for the streams library.

1.  Put the programming model front and center.  Bulk computations with 
streams will not be, in most cases, the absolutely most performant way 
to do anything.  And that's OK.  The goal here is to give a clean way to 
compose complex computations from simple building blocks that work with 
any data source.  We're willing to give up a little potential 
performance for the sake of making a programming model which is clean, 
expressive, and orthogonal.  (And, our performance data so far suggests 
that we're not giving up so much as to be worrisome.)  For a significant 
fraction of the Java code in the world, performance is nowhere near the 
#1 consideration -- a lot of code is plenty fast already.  In these 
cases, being able to express things cleanly, clearly, and in a less 
error-prone manner yields far more value than making it faster.

2.  Use data where it lives.  Users should be able to use existing data 
sources to feed stream computations, without having to reason 
excessively about their characteristics, or do much work to transform 
them into stream sources.  Existing Collections should just work. 
Again, this generality has a cost, which we're willing to pay.  Even 
non-thread-safe collections like ArrayList should permit parallel 
traversal, so long as the user's computation meets the non-interference 
guidelines (i.e., don't mutate the source during the traversal, don't 
provide lambdas that are dependent on state that is modified during the 
traversal.)  We're in the process of formalizing these non-interference 
requirements.

3.  Easy onramp to parallelism.  Until now, the serial and parallel 
expressions of a computation looked dramatically different, and it was a 
lot of work to go from serial to parallel.  And, it was tricky, meaning 
users would make mistakes or avoid trying.  This work should make it 
easy for developers to add parallelism to stream computations without 
major changes to their code.  I hold out no hope that our 
general-purpose approach will ever beat the best hand-tuned code, and 
that's fine.  The goal is to give users an attractive cost-benefit 
equation for parallism; do a trivial amount of work (not quite limited 
to typing ".parallel()", but close), and get a reasonable amount of 
parallelism for almost no cost -- while minimally perturbing the source 
code.

4.  Make a clean break from Old Collections / a bridge to New 
Collections.  The Collections framework was about providing a basic set 
of building blocks for data structures.  Streams is about providing a 
set of building blocks for computation, that is complete divorced from 
the underlying data structure.  Collections were huge in 1997, but 
they're starting to show their age.  We will eventually have to do New 
Collections, for one of any number of reasons: 32-bit size limitation, 
lack of reification, pervasive mutability, take your pick.  We would 
like for Streams to easily fit into those new collections, without tying 
them to Old Collections.  So, it is a goal to keep 
Collection/List/Set/Map out of the core API.

It's bad enough that we are exposing Iterator (Doug and I have been 
looking for alternatives, but so far, for all, the cure is worse than 
the disease.)  The proximate impetus for the Tabulators work was to get 
Map out of the API (as it turned out, the result was far more powerful 
and expressive than what we started with, which is a nice bonus).

5.  Balance between serial and parallel use cases.  I get about an equal 
amount of mail suggesting that supporting the {serial,parallel} scenario 
is distorting the API for the "far more important" {parallel,serial} 
scenario.  Neither camp of extremists will be satisfied here.  We 
experimented early on with separate abstractions for serial and parallel 
streams; the resulting API was byzantine.  (Summary: "OMG too many 
interfaces")  Having one abstraction for both serial and parallel is 
overall a pretty big simplicity win, though it definitely does put 
pressure on the peculiarities of each.  (Doug wants us to get rid of the 
stateful intermediate operations (sorted, removeDuplicates, limit) 
because they compose badly in parallel.  Sam rolls his eyes every time I 
say "but that only works sequentially".)

Where we are in history is that the sequential scenarios are still 
important (and will continue to be for some time), but over time, the 
parallel scenarios will become more important.  Designing something that 
is sequential-centric today would be backward-looking; designing 
something that is parallel-centric today is not useful to a broad slice 
of the user base.  Expect continued tension at the edges as we try to 
balance the needs of both usages (and slowly educate the world about how 
parallel differs from serial.)

6.  Have a path to an open system.  Anyone can create a parallel Stream 
by creating a Spliterator for their data structure (or, by providing an 
Iterator, and letting us split that, albeit less efficiently.)  That's a 
good start.

The Stream API is designed around an extensible set of intermediate and 
terminal operations, which uses an "SPI" to define IntermediateOp, 
StatefulOp, and TerminalOp.  While we do not plan to expose this SPI in 
8 (purely a triage decision; its not ready to stamp into concrete), we 
want to expose it as soon as practical.  The internal pipeline(XxxOp) 
methods would be exposed, and users who create new ops could integrate 
them into existing pipelines like:

   list.stream()
       .filter(...)
       .pipeline(dropEverySecondElement())
       .map(...)
       ...

The "pipeline" method is the escape hatch to add new stages into the 
pipeline (this is actually how all ops are implemented internally now.) 
  Users would then be free to create new intermediate or terminal 
operations and thread them into pipelines with pipeline(op).

7.  Parallelism is explicit.  We don't want to inflict parallelism on 
anyone who doesn't ask for it; retrofitting parallelism transparently in 
Java is likely to be as successful as retrofitting remoteness 
transparently into method invocations.  Our guideline is that the word 
"parallel" must appear somewhere, as in:

   list.parallelStream()
   Arrays.parallelSort(array)


More information about the lambda-libs-spec-observers mailing list