perspectives on streams performance

Tue Mar 10 18:22:05 UTC 2015

Hi John,

For this investigation someone could find useful - as a starting-point -
the high-level performance assessment we recently did -
http://biboudis.github.io/clashofthelambdas/ - and put a magnifying glass
over the Java cases. We have 4 simple JMH’d micro-benchmarks (and their
corresponding for-based ones as baselines) that could be enhanced.
Afterwards, the equivalence between JIT-ted streams and hand-optimized
for-loops could be evaluated. The cases that are missing are mostly
type-polluted pipelines that will help us to understand the costs of
deoptimization.

Kind regards,
Aggelos Biboudis.

On Fri, Mar 6, 2015 at 3:01 AM, John Rose <john.r.rose at oracle.com> wrote:

> In order to get the full benefit from JDK 8 streams we will need to make
> them optimize fully.  Here are a few thoughts about that.
>
> I think of streams as a more concise and orderly replacement of classic
> "for" loops.  Every stream can be rewritten as one or more for-loops, at
> the cost of verbosity and commitment to hard-coding optimizations (like FJ).
>
> A classic "for" loop is a external iterator notation:  The iteration
> machinery is outside of (lexically around) the data structure access.
>  This notation is at least as old as Fortran.  Streams are an internal
> iteration notation:  The iteration machinery (crucially, the looping part
> of the algorithm) is inside the data structure, and only the loop body
> appears in the user code (as a lambda).  This notation is also old, found
> in Lisp and Smalltalk.
>
> External iterators are easier to optimize, because their crucial iteration
> logic is always inlined into the specific loop request as coded by the
> user.  (It has to be, because the user writes it explicitly.)  Existing
> compilers, like HotSpot's, are good at optimizing "for" loops.
>
> HotSpot are less good at internal iterators.  If the original point of the
> user request fails to inline all the way into the internal looping part of
> the algorithm (a hidden "for" loop), the quality of the loop will be very
> poor.  (Exception:  With micro-benchmarks, the loop quality can be
> partially recovered by relying on a pure, clean profile.  But with real
> system code, the hidden "for" loop will have a polluted profile.)  This is
> the problem I have referred to as "loop customization", even though it
> applies to non-looping code as well (as long as there is a template that
> needs expanding in order to gain performance).
>
> If streams are to perform at peak, we need to be able to connect the user
> request (where the user would have written a classic "for" loop, but chose
> a stream instead) to the expansion, customization, and optimization of the
> hidden loop.  (Note that the hidden loop may run in a different thread!
> This defeats the usual forms of inlining.)  Somehow the conditions of the
> user request need to be communicated to the code that actually does the
> work.  The code that actually runs the loop must be customized to take into
> account whatever "syndrome" of template parameters (such as closure bodies
> or operand types) that are critical to optimizing the loop code.
>
> There are two natural scales of optimization:  Per-chunk (sped up by
> multi-threading) and per-loop-body (sped up by vectorization and
> unrolling).  Out of the box, the "parallel" modifier gives good
> multi-threading.  But I am afraid that the loop body optimizations is less
> well behaved, for streams.
>
> I would like to encourage any interested colleagues to examine streams
> performance under the following conditions:
>
> 1. Head-to-head comparison of "for" loops and equivalent streams.
>
> 2. Proper vectorization of both forms of loops, at least for arraywise
> elemental operations, searches, and reductions.  This would include issuing
> the best vectorizing instructions available for the platform (I'm thinking
> Haswell, etc.).
>
> 3. Benchmark management which operates multiple loop examples per JVM, to
> simulate realistically "dirty" hidden-loop profiles.
>
> 4. Artificial suppression of inlining from the request point (of a
> stream-based loop) to the algorithm's hidden loop, again to simulate
> realistically the compilation of loop kernels to run in multiple threads.
>
> All of the above examples are focused on measuring and improving
> per-loop-body optimizations (vectorization and other loop transforms).
> None of them need to be run with FJ or multiple threads.  The JMH framework
> would be very useful for running the tests.
>
> It may be that after a head-to-head comparison we will find that the
> HotSpot optimizer is better than I'm giving it credit for.  I have not made
> these studies myself.  But my usual experience is that rocks like this,
> when you turn them over, have something "interesting" under them.
>
> None of this is urgent.  I'm putting it out as a possibly interesting
> project for people to collaborate on.
>
> — John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150310/ad0c5478/attachment-0001.html>