20120925 C/N/P/Q break-even analysis
Dave Dice
dave.dice at oracle.com
Tue Oct 9 09:01:11 PDT 2012
Hi Aleksey,
Is the source available? I'd like to take a look at this on a few other platforms. Regards, -Dave
On 2012-10-9, at 8:35 AM, Aleksey Shipilev <aleksey.shipilev at oracle.com> wrote:
> Hi there,
>
> (I think this is most of the interest of Doug. Dave is CC'ed if he wants
> to be in the loop).
>
> Below is the layman analysis of the break-even point from the latest
> decomposition experiment. We had to choose a single point at break-even
> front (i.e. the isoline where sequential and parallel versions are
> running at the same speed), and that point is N=3000, Q=20, C=1, P=32.
>
> I had also modified the test a bit to measure one-shot time, instead of
> steady throughput, so that the submissions the library code are done
> with ~10ms think time. This allows FJP to achieve quiescence in between,
> and also exacerbates the threading park/unpark behaviors. This enables
> us to make lots of iterations, in this case, 1000+ iterations, spanning
> over the minute of single run. Having in mind the lesson about
> run-to-run variance, we also measured 5 consecutive JVM invocations.
>
> The target host was 2x8x2 Xeon E5-2680 (SandyBridge) running
> Solaris 11, and 20120925 lambda nightly with -d64 -XX:-TieredCompilation
> -XX:+UseParallelOldGC -XX:+UseNUMA -XX:-UseBiasedLocking
> -XX:+UseCondCardMark.
>
> In short, the performance limiter at the break-even point seem to be
> very light tasks, and the execution is dominated by threads wandering
> around, scanning/parking/unparking, inducing lots of context switches on
> their way. I don't think there is something can be done with this in
> FJP, although maybe Doug has some black magic in his sleeve. The
> predominant stack trace is:
>
> "ForkJoinPool-1-worker-2" #12 daemon prio=5 os_prio=64
> tid=0x00000000052a9260 nid=0x25 runnable [0xfffffd7fe1af7000]
> j.l.Thread.State: RUNNABLE
> at j.l.Thread.yield(Native Method)
> at j.u.c.ForkJoinPool.scan(ForkJoinPool.java:1588)
> at j.u.c.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
> at j.u.c.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
>
>
> There are 256 subtasks born per each external task, and on average each
> subtask spends ~25us in execution. You can go through detailed fjp
> traces here [2]. Note that total execution for the whole subtask graph
> is very slim, less than 500us.
>
> Also, varying P shows up this behavior:
>
> seq base: 364 +- 67 us
> P = 1: 462 +- 45 us
> P = 2: 359 +- 2 us
> P = 4: 332 +- 2 us
> P = 8: 331 +- 2 us
> P = 16: 338 +- 2 us
> P = 32: 379 +- 2 us // hyper-threads in action?
> P = 64: 4758 +- 1048 us // lots of heavy outliers
> P = 128: 34500 +- 3948 us // lots of heavy outliers
>
> A couple of observations: the parallel improves with larger P up to 8,
> and then declines. I wonder if that means we should constrain FJP from
> extending to work on the hyper-threads.
>
> -Aleksey.
>
> [1]
> http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006088.html
> [2] http://shipilev.net/pub/jdk/lambda/20120925-breakeven/
More information about the lambda-dev
mailing list