C/P/N/Q par vs. seq break-even analysis with 10ms think time

Thu Oct 18 05:30:55 PDT 2012

Aleksey.

Thanks for the invitation but I’m an application programmer not a kernel
programmer. The F/J software I maintain considers flexibility, simplicity,
error recovery, and accountability before raw speed. So, I don’t have a
parallel engine alternative to FJP.

I do understand that FJP is all you have. That is the problem. Why haven’t
all those pragmatic engineers put C# to shame? Relying on an
academic-centric niche product doesn’t seem so pragmatic.

I answered your original question, “the execution is dominated by threads
wandering around” by pointing to the work stealing technique. Work stealing
and work sharing both perform with different attributes. For a
general-purpose, commercial application development platform my pick is
work sharing.

The embarrassment comes from the “continuation threads” problem. That is
truly an awful practice. As an application developer, I can tell you that
people very, very, very frequently use software for other than its intended
purpose. (I use F/J for speculative computing.) The more the community
depends on FJP as the parallel engine (lambda, JEP103) the more chance for
calamity. Also, consider the developer who submits an impact statement to
the data center:
Q. How many threads does this software create?
A. 8 or maybe 80 or maybe 800.

One last thing: I’ve never blown out hot air. The article has twelve solid
points with realistic examples and software for comparison. All I’ve said
for over two years now is that FJP doesn’t belong in core Java. And, I’ve
always been polite and respectful to everyone, especially Doug Lea.

One other last thing: Your decomposition benchmark experiments are splendid.

Ed

On Tue, Oct 16, 2012 at 11:27 AM, Aleksey Shipilev <
aleksey.shipilev at oracle.com> wrote:

> Hi,
>
> This is more thorough analysis on what's going on at the break-even
> point in C/P/N/Q experiment [1]. I've took the fjp-trace [2] profiling
> at the break-even point, and the results are here [3]. The new feature
> for fjp-trace can reconstruct the entire decomposition tree, which you
> might want to peek here [4].
>
> Observations:
>  - notice that the handoff from the submitter to FJP takes quite a bit
> of time, somewhat 70us in this case;
>  - the entire task finishes in ~500us, but the trace shows execution for
> only ~310us. This is due to fjp-trace architecture which can not record
> the JOIN in the external submitters (yet). This might very well mean the
> handoff back to the blocked submitter takes another 100us.
>  - threads are waking up rather slow (on this timescale), full-blown
> parallelism lasts for somewhat 50us.
>
> So, here's what we got on the table. If I understand this data
> correctly, then the 500us execution divides as:
>    ~70us: handoff to FJP
>   ~200us: FJP rampup
>    ~50us: FJP steady (even though lots of balancing)
>   ~100us: result handoff
>
> That means if we want to pursue parallel decompositions on smaller
> scale, we need to figure out the rampup effects first. I have yet to
> figure out if the rampup effects is due to sequential decomposition in
> lambda code, or that is the genuine threading lags.
>
> Another thing is the interface between submitter and the FJP. I vaguely
> recall the infrastructure for allowing submitters to run the tasks
> themselves in in place, but how much effort that would take to get to at
> least experimental readiness? (Also, I don't see how/if the
> CountedCompleters could interoperate with submitters in this case, is
> there any option to make submitter to be the last completer?).
>
> Thanks,
> Aleksey.
>
> [1]
> http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006150.html
> [2] https://github.com/shipilev/fjp-trace
> [3] http://shipilev.net/pub/jdk/lambda/20121003-fjpwakeup/
> [4]
>
> http://shipilev.net/pub/jdk/lambda/20121003-fjpwakeup/forkjoin.trace.p24-subtrees.png
>
>