Streams -- philosophy
Brian Goetz
brian.goetz at oracle.com
Mon Dec 31 08:18:41 PST 2012
I'd like to take a quick step back to go over some of the philosophical
goals for the streams library.
1. Put the programming model front and center. Bulk computations with
streams will not be, in most cases, the absolutely most performant way
to do anything. And that's OK. The goal here is to give a clean way to
compose complex computations from simple building blocks that work with
any data source. We're willing to give up a little potential
performance for the sake of making a programming model which is clean,
expressive, and orthogonal. (And, our performance data so far suggests
that we're not giving up so much as to be worrisome.) For a significant
fraction of the Java code in the world, performance is nowhere near the
#1 consideration -- a lot of code is plenty fast already. In these
cases, being able to express things cleanly, clearly, and in a less
error-prone manner yields far more value than making it faster.
2. Use data where it lives. Users should be able to use existing data
sources to feed stream computations, without having to reason
excessively about their characteristics, or do much work to transform
them into stream sources. Existing Collections should just work.
Again, this generality has a cost, which we're willing to pay. Even
non-thread-safe collections like ArrayList should permit parallel
traversal, so long as the user's computation meets the non-interference
guidelines (i.e., don't mutate the source during the traversal, don't
provide lambdas that are dependent on state that is modified during the
traversal.) We're in the process of formalizing these non-interference
requirements.
3. Easy onramp to parallelism. Until now, the serial and parallel
expressions of a computation looked dramatically different, and it was a
lot of work to go from serial to parallel. And, it was tricky, meaning
users would make mistakes or avoid trying. This work should make it
easy for developers to add parallelism to stream computations without
major changes to their code. I hold out no hope that our
general-purpose approach will ever beat the best hand-tuned code, and
that's fine. The goal is to give users an attractive cost-benefit
equation for parallism; do a trivial amount of work (not quite limited
to typing ".parallel()", but close), and get a reasonable amount of
parallelism for almost no cost -- while minimally perturbing the source
code.
4. Make a clean break from Old Collections / a bridge to New
Collections. The Collections framework was about providing a basic set
of building blocks for data structures. Streams is about providing a
set of building blocks for computation, that is complete divorced from
the underlying data structure. Collections were huge in 1997, but
they're starting to show their age. We will eventually have to do New
Collections, for one of any number of reasons: 32-bit size limitation,
lack of reification, pervasive mutability, take your pick. We would
like for Streams to easily fit into those new collections, without tying
them to Old Collections. So, it is a goal to keep
Collection/List/Set/Map out of the core API.
It's bad enough that we are exposing Iterator (Doug and I have been
looking for alternatives, but so far, for all, the cure is worse than
the disease.) The proximate impetus for the Tabulators work was to get
Map out of the API (as it turned out, the result was far more powerful
and expressive than what we started with, which is a nice bonus).
5. Balance between serial and parallel use cases. I get about an equal
amount of mail suggesting that supporting the {serial,parallel} scenario
is distorting the API for the "far more important" {parallel,serial}
scenario. Neither camp of extremists will be satisfied here. We
experimented early on with separate abstractions for serial and parallel
streams; the resulting API was byzantine. (Summary: "OMG too many
interfaces") Having one abstraction for both serial and parallel is
overall a pretty big simplicity win, though it definitely does put
pressure on the peculiarities of each. (Doug wants us to get rid of the
stateful intermediate operations (sorted, removeDuplicates, limit)
because they compose badly in parallel. Sam rolls his eyes every time I
say "but that only works sequentially".)
Where we are in history is that the sequential scenarios are still
important (and will continue to be for some time), but over time, the
parallel scenarios will become more important. Designing something that
is sequential-centric today would be backward-looking; designing
something that is parallel-centric today is not useful to a broad slice
of the user base. Expect continued tension at the edges as we try to
balance the needs of both usages (and slowly educate the world about how
parallel differs from serial.)
6. Have a path to an open system. Anyone can create a parallel Stream
by creating a Spliterator for their data structure (or, by providing an
Iterator, and letting us split that, albeit less efficiently.) That's a
good start.
The Stream API is designed around an extensible set of intermediate and
terminal operations, which uses an "SPI" to define IntermediateOp,
StatefulOp, and TerminalOp. While we do not plan to expose this SPI in
8 (purely a triage decision; its not ready to stamp into concrete), we
want to expose it as soon as practical. The internal pipeline(XxxOp)
methods would be exposed, and users who create new ops could integrate
them into existing pipelines like:
list.stream()
.filter(...)
.pipeline(dropEverySecondElement())
.map(...)
...
The "pipeline" method is the escape hatch to add new stages into the
pipeline (this is actually how all ops are implemented internally now.)
Users would then be free to create new intermediate or terminal
operations and thread them into pipelines with pipeline(op).
7. Parallelism is explicit. We don't want to inflict parallelism on
anyone who doesn't ask for it; retrofitting parallelism transparently in
Java is likely to be as successful as retrofitting remoteness
transparently into method invocations. Our guideline is that the word
"parallel" must appear somewhere, as in:
list.parallelStream()
Arrays.parallelSort(array)
More information about the lambda-libs-spec-observers
mailing list