Parallel decompositions, C/P/N/Q experiment, take 2

Aleksey Shipilev aleksey.shipilev at
Mon Sep 24 06:26:47 PDT 2012


This is the second take on the experiment I've did couple of months ago
[1]. tl;dr version: hand-crafted generator for longs in
(0; N], simple filter (with variable cost Q) to empty sink, called by C
clients, stream operations services by (fj)pool of size P.

The question to answer: how would performance change with juggling
C/P/N/Q, in both sequential and parallel modes?

Running in on 4x10x2 Nehalem-EX 2.27 Ghz, RHEL 5, JDK8 x86_64
20120920-default-nightly, and -XX:-TieredCompilation
-XX:+UseParallelOldGC -XX:+UseNUMA -XX:-UseBiasedLocking
-XX:+UseCondCardMark, produced these results [2], which I wrapped into
single report [3].

Some observations:
 * We have good improvements with large enough N and Q, and the
throughput predictably grows with growing N or Q.
 * In many cases, having larger P for the same N and Q yields lower
performance. See for example 1/10/10000/100 = 6.8x, 1/80/10000/100 =
2.8x; or even worse, 1/10/1000/100 = 2.28x, 1/80/1000/100 = 0.28x.
 * We have tremendous disadvantages for going for parallel versions on
low N, and this exacerbated very seriously when dealing with low Q.

Comments, observations, suggestions are welcome.



More information about the lambda-dev mailing list