Bulk operations performance: juggling external conditions

Tue Jul 31 08:29:52 PDT 2012

Hi,

I am doing my first experiments on bulk operations performance, and here
are the preliminary results. These are not yet complete and understood,
but they might be interesting to munch in the mean time.

I. THE GOAL

In short, I was trying to understand the behavior of current bulk
implementation when problem size (N), pool parallelism (P), operation
cost (Q), and number of external client threads asking for bulk
operation (C) are changing. Not all of them are settable or even
measurable in the real world use cases, but the understanding how these
parameters affect given metrics is important to get intuition and some
models around the code we have.

II. THE EXPERIMENT

I could not yet publish the complete source code, but this should give
you the idea behind this experiment. The hardest part in this endeavor
is to make the task which cost is linearly growing with Q. After some
juggling around, I've settled on this:

    public static boolean doWork(long input, int Q) {
        long t = input;
        ThreadLocalRandom random = ThreadLocalRandom.current();
        for (int i = 0; i < Q; i++) {
            t += random.nextInt();
        }
        return (random.nextInt(0x27072012) == 0) && (t > 42);
    }

This has quite a few advantages. First, given TLR's seed is unknown
before the actual run, there is no possibility to pre-optimize. Second,
the loop unrolling effects are mitigated by hard dependencies on TLR's
state. Third, the result of the loop is being vaguely used, to limit
DCE. Fourth, virtually impossible condition are the exit almost always
makes the result false. Fifth, it depends on $input, and so prevents
further optimization.

This is handy, because we can then stick it to our microbenchmark like this:
        longRange.filter(k -> doWork(k, Q)).into(sink);

No matter how sequential $sink is, given filter is almost always false,
there is no elements make up to the sink. It is unsynchronized long
summator anyway. $longRange is special Splittable which answers Long-s
in desired intervals. Having this around helps to alleviate cache
effects depending on $N. Initial range is [0; $N).

III. SAMPLE RESULT

Now, we can run this test juggling N/P/Q/C on fairly decent machine. In
my case, it is 4x10x2 Nehalem-EX 2.27 Ghz, RHEL 5, JDK8 x86_64
8-lambda-b48.

The key points I've did for this trial are following:
  P = {1, 2, 10, 20, 40, 80} // up to package, up to all hyperthreads
  C = {1, 2, 10, 20, 40, 80} // similar to $P
  Q = {1, 10, 100, 1000, 10000} // calibrated to 6*Q ns/op
  N = {1, 10, 100, 1000, 10000} // need to get N*Q not too large

I had also gathered sar statistics for each of the point, because I was
interested in CPU utilizations there.

This experiment ends up with 5D data: (P, C, Q, N) -> (Throughput|SAR
data), which makes it tricky to analyze and understand. As usual, my
take on that is R script [1], which generates tons of charts [2].
Numerical data is available as well [3].

Living in 3D world with 2D monitors, we can only have three dimensions
per chart before our head explodes. Other two dimensions are sliced: you
might notice the green and pink bars over each panel on the page --
these are the free variables, you might see the actual value for the
panel as the tick there. There are multiple pages trying to look at that
data on different angles. If you have the idea how to make it more
understandable, please don't hesitate to drop me a message.

There are couple of peculiar things I see on these charts already, but
I'll let everyone else to look for them on their own at the moment :)
Sergey and I were staring at this for tens of minutes before our "aha"
moments.

Thanks for reading this overly technical note,
-Aleksey.

[1] http://shipilev.net/pub/jdk/lambda/bulk-fuzzy-1/plot.r
[2] http://shipilev.net/pub/jdk/lambda/bulk-fuzzy-1/plots.pdf
[3] http://shipilev.net/pub/jdk/lambda/bulk-fuzzy-1/*.data