20120925 C/N/P/Q break-even analysis

Tue Oct 9 05:35:51 PDT 2012

Hi there,

(I think this is most of the interest of Doug. Dave is CC'ed if he wants
to be in the loop).

Below is the layman analysis of the break-even point from the latest
decomposition experiment. We had to choose a single point at break-even
front (i.e. the isoline where sequential and parallel versions are
running at the same speed), and that point is N=3000, Q=20, C=1, P=32.

I had also modified the test a bit to measure one-shot time, instead of
steady throughput, so that the submissions the library code are done
with ~10ms think time. This allows FJP to achieve quiescence in between,
and also exacerbates the threading park/unpark behaviors.  This enables
us to make lots of iterations, in this case, 1000+ iterations, spanning
over the minute of single run. Having in mind the lesson about
run-to-run variance, we also measured 5 consecutive JVM invocations.

The target host was 2x8x2 Xeon E5-2680 (SandyBridge) running
Solaris 11, and 20120925 lambda nightly with -d64 -XX:-TieredCompilation
-XX:+UseParallelOldGC -XX:+UseNUMA -XX:-UseBiasedLocking
-XX:+UseCondCardMark.

In short, the performance limiter at the break-even point seem to be
very light tasks, and the execution is dominated by threads wandering
around, scanning/parking/unparking, inducing lots of context switches on
their way. I don't think there is something can be done with this in
FJP, although maybe Doug has some black magic in his sleeve. The
predominant stack trace is:

"ForkJoinPool-1-worker-2" #12 daemon prio=5 os_prio=64
tid=0x00000000052a9260 nid=0x25 runnable [0xfffffd7fe1af7000]
   j.l.Thread.State: RUNNABLE
       at j.l.Thread.yield(Native Method)
       at j.u.c.ForkJoinPool.scan(ForkJoinPool.java:1588)
       at j.u.c.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
       at j.u.c.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

There are 256 subtasks born per each external task, and on average each
subtask spends ~25us in execution. You can go through detailed fjp
traces here [2]. Note that total execution for the whole subtask graph
is very slim, less than 500us.

Also, varying P shows up this behavior:

  seq base:  364 +-   67 us
  P = 1:     462 +-   45 us
  P = 2:     359 +-    2 us
  P = 4:     332 +-    2 us
  P = 8:     331 +-    2 us
  P = 16:    338 +-    2 us
  P = 32:    379 +-    2 us  // hyper-threads in action?
  P = 64:   4758 +- 1048 us  // lots of heavy outliers
  P = 128: 34500 +- 3948 us  // lots of heavy outliers

A couple of observations: the parallel improves with larger P up to 8,
and then declines. I wonder if that means we should constrain FJP from
extending to work on the hyper-threads.

-Aleksey.

[1]
http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006088.html
[2] http://shipilev.net/pub/jdk/lambda/20120925-breakeven/