20120925 C/N/P/Q break-even analysis

Tue Oct 9 09:01:11 PDT 2012

Hi Aleksey,

Is the source available?   I'd like to take a look at this on a few other platforms.   Regards, -Dave

On 2012-10-9, at 8:35 AM, Aleksey Shipilev <aleksey.shipilev at oracle.com> wrote:

> Hi there,
> 
> (I think this is most of the interest of Doug. Dave is CC'ed if he wants
> to be in the loop).
> 
> Below is the layman analysis of the break-even point from the latest
> decomposition experiment. We had to choose a single point at break-even
> front (i.e. the isoline where sequential and parallel versions are
> running at the same speed), and that point is N=3000, Q=20, C=1, P=32.
> 
> I had also modified the test a bit to measure one-shot time, instead of
> steady throughput, so that the submissions the library code are done
> with ~10ms think time. This allows FJP to achieve quiescence in between,
> and also exacerbates the threading park/unpark behaviors.  This enables
> us to make lots of iterations, in this case, 1000+ iterations, spanning
> over the minute of single run. Having in mind the lesson about
> run-to-run variance, we also measured 5 consecutive JVM invocations.
> 
> The target host was 2x8x2 Xeon E5-2680 (SandyBridge) running
> Solaris 11, and 20120925 lambda nightly with -d64 -XX:-TieredCompilation
> -XX:+UseParallelOldGC -XX:+UseNUMA -XX:-UseBiasedLocking
> -XX:+UseCondCardMark.
> 
> In short, the performance limiter at the break-even point seem to be
> very light tasks, and the execution is dominated by threads wandering
> around, scanning/parking/unparking, inducing lots of context switches on
> their way. I don't think there is something can be done with this in
> FJP, although maybe Doug has some black magic in his sleeve. The
> predominant stack trace is:
> 
> "ForkJoinPool-1-worker-2" #12 daemon prio=5 os_prio=64
> tid=0x00000000052a9260 nid=0x25 runnable [0xfffffd7fe1af7000]
>   j.l.Thread.State: RUNNABLE
>       at j.l.Thread.yield(Native Method)
>       at j.u.c.ForkJoinPool.scan(ForkJoinPool.java:1588)
>       at j.u.c.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
>       at j.u.c.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
> 
> 
> There are 256 subtasks born per each external task, and on average each
> subtask spends ~25us in execution. You can go through detailed fjp
> traces here [2]. Note that total execution for the whole subtask graph
> is very slim, less than 500us.
> 
> Also, varying P shows up this behavior:
> 
>  seq base:  364 +-   67 us
>  P = 1:     462 +-   45 us
>  P = 2:     359 +-    2 us
>  P = 4:     332 +-    2 us
>  P = 8:     331 +-    2 us
>  P = 16:    338 +-    2 us
>  P = 32:    379 +-    2 us  // hyper-threads in action?
>  P = 64:   4758 +- 1048 us  // lots of heavy outliers
>  P = 128: 34500 +- 3948 us  // lots of heavy outliers
> 
> A couple of observations: the parallel improves with larger P up to 8,
> and then declines. I wonder if that means we should constrain FJP from
> extending to work on the hyper-threads.
> 
> -Aleksey.
> 
> [1]
> http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006088.html
> [2] http://shipilev.net/pub/jdk/lambda/20120925-breakeven/