ForkJoin wakeup update

Thu Oct 18 11:03:42 PDT 2012

On 10/18/12 13:29, Aleksey Shipilev wrote:

> Following Doug's advice, I've did a few experimental changes, described
> below:
>   * vanilla: plain vanilla FJP
>   * wakeup-3: wake up 3 threads instead of 2 during wakeups
>   * wakeup-4: wake up 4 threads instead of 2 during wakeups
>   * scan-busyloop: replace park in scan() with yield
>
> On the same machine [4], the scores are:
>        vanilla: 429 +- 22 us/op
>       wakeup-3: 415 +- 15 us/op
>       wakeup-4: 390 +- 18 us/op
> scan-busyloop: 150 +- 15 us/op
>

Thanks! This leads to a much more precise version of
my main observations about FJ in other contexts:

* The break-even point (in terms of elements * cost per element)
can be reduced by about a factor of three at the expense of wasting
the world's power by keeping parallel worker threads spinning.
Which we really do not want to do.

* Short of that, the case for being only a bit more aggressive
(and possibly wasteful) in initial startup wakeups isn't all that
compelling but might be worth more tuning effort.

* The main systems-level goal should be to find ways to make
wakeups faster. Possibly entailing support of some sort of bulk
signalling facilities?

-Doug

> The fjp-trace results with tracing enabled are here [3]. The analysis
> for their subtrees follows:
>
> vanilla:
>     +60us: first handoff (dominated by unpark)
>    +240us: fjp rampup (all threads to wake up)
>     +20us: full parallelism
>     +15us: catch up (the final completer lags behind?)
>     +10us: result handoff
>
> wakeup-3:
>     +80us: first handoff (dominated by unpark)
>    +180us: fjp rampup (all threads to wake up)
>     +40us: full parallelism
>     +20us: catch up (the final completer lags behind?)
>     +10us: result handoff
>
> wakeup-4:
>     +70us: first handoff (dominated by unpark)
>    +170us: fjp rampup (all threads to wake up)
>     +50us: full parallelism
>     +30us: catch up (the final completer lags behind?)
>     +10us: result handoff
>
> scan-busyloop:
>      +5us: first handoff (nothing to unpark)
>     +10us: fjp rampup (pre-balancing)
>     +55us: full parallelism
>     +10us: catch up (the final completer lags behind?)
>     +10us: result handoff
>
> Thanks,
> Aleksey.
>
> [1]
> http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006167.html
> [2] https://github.com/shipilev/fjp-trace
> [3] http://shipilev.net/pub/jdk/lambda/20121017-fjpwakeup/
> [4] 2x8x2 Xeon E5-2680 (SandyBridge) running Solaris 11, and 20121003
> lambda nightly with -d64 -XX:-TieredCompilation -XX:+UseParallelOldGC
> -XX:+UseNUMA -XX:-UseBiasedLocking -XX:+UseCondCardMark
>