Parallel decompositions, C/P/N/Q experiment, take 2

Mon Sep 24 06:26:47 PDT 2012

Hi,

This is the second take on the experiment I've did couple of months ago
[1]. tl;dr version: hand-crafted generator for longs in
(0; N], simple filter (with variable cost Q) to empty sink, called by C
clients, stream operations services by (fj)pool of size P.

The question to answer: how would performance change with juggling
C/P/N/Q, in both sequential and parallel modes?

Running in on 4x10x2 Nehalem-EX 2.27 Ghz, RHEL 5, JDK8 x86_64
20120920-default-nightly, and -XX:-TieredCompilation
-XX:+UseParallelOldGC -XX:+UseNUMA -XX:-UseBiasedLocking
-XX:+UseCondCardMark, produced these results [2], which I wrapped into
single report [3].

Some observations:
 * We have good improvements with large enough N and Q, and the
throughput predictably grows with growing N or Q.
 * In many cases, having larger P for the same N and Q yields lower
performance. See for example 1/10/10000/100 = 6.8x, 1/80/10000/100 =
2.8x; or even worse, 1/10/1000/100 = 2.28x, 1/80/1000/100 = 0.28x.
 * We have tremendous disadvantages for going for parallel versions on
low N, and this exacerbated very seriously when dealing with low Q.

Comments, observations, suggestions are welcome.

Thanks,
Aleksey.

[1] http://mail.openjdk.java.net/pipermail/lambda-dev/2012-July/005333.html
[2] http://shipilev.net/pub/jdk/lambda/bulk-fuzzy-3/
[3] http://shipilev.net/pub/jdk/lambda/bulk-fuzzy-3/bulk-fuzzy-3.pdf