Par vs. Seq decomposition experiment update

Mon Dec 10 08:02:41 PST 2012

Hi,

Dust had settled over FJP, and other library changes, so here is the
update on my par-vs-seq experiments. We are using the same decomposition
benchmark on 2x8x2 Xeon E5-2680
(SandyBridge) running Solaris 11, and 20121203 lambda nightly with -d64
-XX:-TieredCompilation -XX:+UseParallelOldGC -XX:+UseNUMA
-XX:-UseBiasedLocking -XX:+UseCondCardMark.

tl;dr version of the experiment: hand-crafted stream source for longs in
(0; N], simple filter (with variable cost Q) to empty sink, called by C
clients, stream operations services by (fj)pool of size P.

We concentrate on C=1, P=32 plane in this experiment.

Clients are requesting the pipeline computation every once in a while,
we let 10ms thinktime between requests for everything to settle down
(this is a harsh mode for FJP to run in, but realistic enough for our
use cases). Having thinktimes greatly reduces the number of tasks, and
so we are able to do proper time sampling, which in turn lets us to
compute percentiles on execution times. This filters out GCs, as well as
provide more insights into the variance.

The results are here:
 http://shipilev.net/pub/jdk/lambda/20121205-breakeven/

There are lots of data, but not too complex.

BIRD'S EYE VIEW:

breakeven.pdf [1] showcases the ratios between seq and par times:
  * with large enough N and Q we are enjoying very nice speedups, up to
scaling nearly perfectly on the target machine, on almost all percentile
values. The only exception is max time, which is problematic for
parallel versions (due to longer GCs incurring more stalls with more
threads?)
  * as usual, break-even front seems to lie through N*Q isoline,
somewhere around N*Q = 10^5 for min time, and N*Q = 5*10^4 for p90.

times.pdf [2] showcases the raw execution times:
  * it is probably more convenient to look at logarithmic scales
  * seq seems to react nicely to both N and Q, except for max time
again, which is more affected by N (my hypothesis is that we have much
better chance to hit GC with larger N during the measurement window)
  * par execution times are consistently worse, and the only reason why
par is winning in [1] is parallelism

spreads.pdf [3] showcases the execution time spreads:
  * seq experiences very low variance with large N*Q
  * seq experiences mediocre variance with low N*Q (I'd speculate,
playing against power states here)
  * the weird and reproducible behavior is that seq has the red area
right on break-even front, where variance is really high!
  * par experiences more or less consistent variance across N*Q

FORK/JOIN TRACES:

As usual, we do fjp-trace [4] on some break-even point to understand how
FJP performs there. The sample decomposition tree for parallel operation
at N=100, Q=1000 is here [5]:
  * submitter/common pool changes are settled in, so submitter is doing
lots of work
  * first thread wakes up in the pool 100us after the submission
  * full-blown parallelism is there 200us after the submission
  * we run with full parallel for another ~200us
  * the final handoff burns another 40us
  * there are lots of completion actions, and their summary impact is
more or less visible, if you count the total times between the events:
              EXEC -> EXECUTED: 7021ms
       COMPLETING -> COMPLETED: 1603ms

For the reference, this is how N=1000, Q=100000 (very saturated FJP)
looks like [6]. What's interesting is that completions are taking much
less time there:
              EXEC -> EXECUTED: 117537ms
       COMPLETING -> COMPLETED: 123ms

Maybe this is something worth for me to look at. I think Doug had some
thoughts on the completion mechanics we have right now.

Thanks,
Aleksey.

[1] http://shipilev.net/pub/jdk/lambda/20121205-breakeven/breakeven.pdf
[2] http://shipilev.net/pub/jdk/lambda/20121205-breakeven/times.pdf
[3] http://shipilev.net/pub/jdk/lambda/20121205-breakeven/spreads.pdf
[4] https://github.com/shipilev/fjp-trace
[5]
http://shipilev.net/pub/jdk/lambda/20121205-breakeven/fjp-breakeven/forkjoin.trace-subtrees.png
[6]
http://shipilev.net/pub/jdk/lambda/20121205-breakeven/fjp-full/forkjoin.trace-subtrees.png