RFR: JDK-8277095 : Empty streams create too many objects [v2]

Thu Jul 21 02:12:14 UTC 2022

On Tue, 16 Nov 2021 20:53:26 GMT, kabutz <duke at openjdk.org> wrote:

>> This is a draft proposal for how we could improve stream performance for the case where the streams are empty. Empty collections are common-place. If we iterate over them with an Iterator, we would have to create one small Iterator object (which could often be eliminated) and if it is empty we are done. However, with Streams we first have to build up the entire pipeline, until we realize that there is no work to do. With this example, we change Collection#stream() to first check if the collection is empty, and if it is, we simply return an EmptyStream. We also have EmptyIntStream, EmptyLongStream and EmptyDoubleStream. We have taken great care for these to have the same characteristics and behaviour as the streams returned by Stream.empty(), IntStream.empty(), etc. 
>> 
>> Some of the JDK tests fail with this, due to ClassCastExceptions (our EmptyStream is not an AbstractPipeline) and AssertionError, since we can call some methods repeatedly on the stream without it failing. On the plus side, creating a complex stream on an empty stream gives us upwards of 50x increase in performance due to a much smaller object allocation rate. This PR includes the code for the change, unit tests and also a JMH benchmark to demonstrate the improvement.
>
> kabutz has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Refactored empty stream implementations to reduce duplicate code and improved unordered()
>   Added StreamSupport.empty for primitive spliterators and use that in Arrays.stream()

I agree it’s the “kind of” optimization that would be nice.  “Kind of”.  Personally I would be happier to see complexity like this added that would help a larger class of common streams.

It’s a harder problem, and I know this is case of “the best is the enemy of the good”, but I think a stream which has less content bulk than pipeline phases (according to some heuristic weighting) might possibly be handled better by dumping the elements into an Object array and running each phase in sequence over that array, rather than composing a “net result of all phases” object and then running it over the few elements.  Stream object creation can be reduced, perhaps, by building the stream around a small internal buffer that collects pipeline phases (and their lambdas), by side effect.  The terminal operation runs this Rube-Goldberg contraption in an interpretive manner over the elements.  An advantage would arise if the contraption were smaller and simpler than a fully-composed stream of today, and the optimizations lost by having an interpreter instead of a specialized object ness were insignificant due to the small bulk of the stream source.

If such an optimization really works, it would automatically handle the zero-elements case, but would also cover lots of use cases (in the JDK code even) where streams are used for their notational convenience, rather than their ability to process many elements efficiently.

I looked at this for a few days, several months ago.  I solved enough problems to see that there is a long tail of difficulties in stream implementation!  (Many are noted in this PR thread.)  I noticed that some pipeline phases can expand the “bulk” (flatMap and friends).  Bulk-reducing phases (filter) are not a problem, for already-small streams.  (These issues do not arise for truly empty streams.)  For expansion there would have to be an option to inflate a previously-small stream to use the general paths.  Another issue is avoiding running any lambdas until the terminal operation, which means capturing them in some orderly fashion.  Again, if a bizarre pipeline structure shows up, inflation is an option.  And for truly empty streams some or all of the pipeline structure can be just dropped on the floor, as this PR proposes.

In the end, I think the best leverage will come from a completely different set of techniques, from whatever allows us to do “loop customization”.  By loop customization which I mean the ability to compile the loop in the terminal operation separately for each “interestingly distinct” combination of pipeline components, in such a way that the loop can constant-fold their lambdas, the shape of the stream source, and anything else that affects loop code quality.   That technique should apply well regardless of “bulk”, since the most complex object allocation should happen during specialization, which is a link-time operation, instead of every time a stream is created.

Current state of the art uses mega-inlining, which has complex limitations and failure modes.  (It utterly fails for parallel streams.)  When we get to specialized generics, I hope to take another run at the problem, so that the type of each pipeline lambda “feeds” into a specialization syndrome that the JVM “sees” distinct from other specializations, and can optimize separately.  (Making streams be value classes could help also.)  I guess all *that* might be “the impossible dream being the enemy of the best”.

Anyway, FWIW…

-------------

PR: https://git.openjdk.org/jdk/pull/6275