Par vs. Seq decomposition experiment update

Mon Dec 10 08:21:00 PST 2012

Has the N*Q isoline moved closer to the origin compared to previous 
experiments?

Separately, I have a good understanding of what N=10^3 means.  But how 
much "real computation" does Q=10^2 mean?

On 12/10/2012 11:02 AM, Aleksey Shipilev wrote:
> Hi,
>
> Dust had settled over FJP, and other library changes, so here is the
> update on my par-vs-seq experiments. We are using the same decomposition
> benchmark on 2x8x2 Xeon E5-2680
> (SandyBridge) running Solaris 11, and 20121203 lambda nightly with -d64
> -XX:-TieredCompilation -XX:+UseParallelOldGC -XX:+UseNUMA
> -XX:-UseBiasedLocking -XX:+UseCondCardMark.
>
> tl;dr version of the experiment: hand-crafted stream source for longs in
> (0; N], simple filter (with variable cost Q) to empty sink, called by C
> clients, stream operations services by (fj)pool of size P.
>
> We concentrate on C=1, P=32 plane in this experiment.
>
> Clients are requesting the pipeline computation every once in a while,
> we let 10ms thinktime between requests for everything to settle down
> (this is a harsh mode for FJP to run in, but realistic enough for our
> use cases). Having thinktimes greatly reduces the number of tasks, and
> so we are able to do proper time sampling, which in turn lets us to
> compute percentiles on execution times. This filters out GCs, as well as
> provide more insights into the variance.
>
> The results are here:
>   http://shipilev.net/pub/jdk/lambda/20121205-breakeven/
>
> There are lots of data, but not too complex.
>
> BIRD'S EYE VIEW:
>
> breakeven.pdf [1] showcases the ratios between seq and par times:
>    * with large enough N and Q we are enjoying very nice speedups, up to
> scaling nearly perfectly on the target machine, on almost all percentile
> values. The only exception is max time, which is problematic for
> parallel versions (due to longer GCs incurring more stalls with more
> threads?)
>    * as usual, break-even front seems to lie through N*Q isoline,
> somewhere around N*Q = 10^5 for min time, and N*Q = 5*10^4 for p90.
>
> times.pdf [2] showcases the raw execution times:
>    * it is probably more convenient to look at logarithmic scales
>    * seq seems to react nicely to both N and Q, except for max time
> again, which is more affected by N (my hypothesis is that we have much
> better chance to hit GC with larger N during the measurement window)
>    * par execution times are consistently worse, and the only reason why
> par is winning in [1] is parallelism
>
> spreads.pdf [3] showcases the execution time spreads:
>    * seq experiences very low variance with large N*Q
>    * seq experiences mediocre variance with low N*Q (I'd speculate,
> playing against power states here)
>    * the weird and reproducible behavior is that seq has the red area
> right on break-even front, where variance is really high!
>    * par experiences more or less consistent variance across N*Q
>
>
> FORK/JOIN TRACES:
>
> As usual, we do fjp-trace [4] on some break-even point to understand how
> FJP performs there. The sample decomposition tree for parallel operation
> at N=100, Q=1000 is here [5]:
>    * submitter/common pool changes are settled in, so submitter is doing
> lots of work
>    * first thread wakes up in the pool 100us after the submission
>    * full-blown parallelism is there 200us after the submission
>    * we run with full parallel for another ~200us
>    * the final handoff burns another 40us
>    * there are lots of completion actions, and their summary impact is
> more or less visible, if you count the total times between the events:
>                EXEC -> EXECUTED: 7021ms
>         COMPLETING -> COMPLETED: 1603ms
>
>
> For the reference, this is how N=1000, Q=100000 (very saturated FJP)
> looks like [6]. What's interesting is that completions are taking much
> less time there:
>                EXEC -> EXECUTED: 117537ms
>         COMPLETING -> COMPLETED: 123ms
>
> Maybe this is something worth for me to look at. I think Doug had some
> thoughts on the completion mechanics we have right now.
>
> Thanks,
> Aleksey.
>
> [1] http://shipilev.net/pub/jdk/lambda/20121205-breakeven/breakeven.pdf
> [2] http://shipilev.net/pub/jdk/lambda/20121205-breakeven/times.pdf
> [3] http://shipilev.net/pub/jdk/lambda/20121205-breakeven/spreads.pdf
> [4] https://github.com/shipilev/fjp-trace
> [5]
> http://shipilev.net/pub/jdk/lambda/20121205-breakeven/fjp-breakeven/forkjoin.trace-subtrees.png
> [6]
> http://shipilev.net/pub/jdk/lambda/20121205-breakeven/fjp-full/forkjoin.trace-subtrees.png
>