Need some help for a book intro to JMH

Fri Aug 12 21:46:03 UTC 2016

Hi Bruce,

On 08/12/2016 08:33 PM, Bruce Eckel wrote:
> So my questions are:
> 1. Is Arrays.setAll() vs Arrays.parallelSetAll() just too tricky and subtle
> as an example, or are there JMH annotations I can add to fix this?

Yes, too tricky, see below.

> 2. Is there a better introductory example I should be using? (Ideally such
> an example would also show problems when using simple timing).

We have been trying (and humbly succeeding) to teach users the basic
methodology, what to expect from a bad benchmark, and how JMH can help
via the runnable JMH Samples:
 http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/

I don't think there is a single good example that drives the point home.

> Here is the introductory section (so far), including test results:

I can proof-read this more carefully in a private conversation, if you
want.

> ```java
> // verifying/jmhtests/ParallelSetAll.java
> package verifying.jmhtests;
> import java.util.*;
> import org.openjdk.jmh.annotations.*;
> 
> @State(Scope.Thread)
> public class ParallelSetAll {
>   private long[] la;
>   @Setup
>   public void setup() {
>     la = new long[20_000_000];
>   }
>   @Benchmark
>   public void setAll() {
>     Arrays.setAll(la, n -> n);
>   }
>   @Benchmark
>   public void parallelSetAll() {
>     Arrays.parallelSetAll(la, n -> n);
>   }
> }
> ```

The benchmark code looks okay, but this is not the complete benchmarking
story for parallel workloads.

The problem with these one-off tests is that they may exercise a
particular sweet spot on a particular machine. It's weird that people
try to collect more data by running on different machines, when much
more interesting degrees of freedom are available:
 a) C: the number of *client* threads that do ops;
 b) P: the amount of parallelism used by parallel algo;
 c) N: the size of the array: 10^(2*k), where k=1..7 is usually okay to
exercise different cache footprints;
 d) Q: the cost of the setter op;

This C/P/N/Q model surfaced during early JDK 8 Lambda development, and
we saw most parallel Streams operations (and parallelSetAll is very
similar to that) agree with these:
 a) N*Q (basically, the amount of work) is critical to parallel
performance. If there is less than N*Q work, parallel algo may work slower.
 b) There are cases where op is so contended, there is no help from
parallelism at all, no matter how large N*Q is.
 c) When C is high, P is much less relevant (IOW, a lot of external
parallelism makes internal parallelism redundant). Moreover, in some
cases, the cost the parallel decomposition makes C clients running a
parallel algo run slower than the same C clients running sequential code.

Maurice Naftalin did a nice section in his "Mastering Lambdas: Java
Programming in a Multicore World". Doug Lea has a nice guidance on when
to use parallel streams too:
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html.

So, if you don't want to explain and follow up on those bits while
benchmarking parallelSetAll, I would suggest to move on to something
less parallel :)

But if you do want to go this particular rabbit hole, let's say we juggle N:

    @Param({"1", "100", "10000", "1000000", "20000000"})
    int count;

    @Setup
    public void setup() {
        la = new long[count];
    }

On my 4-core i7-4790K, JDK 8u101, Linux x86_64 (time/op, the lower the
better):

  Benchmark        (count)  Mode  Cnt      Score     Error  Units

  parallelSetAll         1  avgt    5      0.035 ±   0.001  us/op
  parallelSetAll        10  avgt    5      7.219 ±   0.119  us/op
  parallelSetAll       100  avgt    5      4.656 ±   0.052  us/op
  parallelSetAll      1000  avgt    5      4.368 ±   0.112  us/op
  parallelSetAll     10000  avgt    5      9.109 ±   0.141  us/op
  parallelSetAll    100000  avgt    5     21.096 ±   0.243  us/op
  parallelSetAll   1000000  avgt    5    211.409 ±  49.143  us/op
  parallelSetAll  20000000  avgt    5  15069.037 ± 301.859  us/op

  setAll                 1  avgt    5      0.001 ±   0.001  us/op
  setAll                10  avgt    5      0.005 ±   0.001  us/op
  setAll               100  avgt    5      0.031 ±   0.001  us/op
  setAll              1000  avgt    5      0.304 ±   0.001  us/op
  setAll             10000  avgt    5      3.167 ±   0.069  us/op
  setAll            100000  avgt    5     34.891 ±   0.067  us/op
  setAll           1000000  avgt    5    433.957 ±   1.957  us/op
  setAll          20000000  avgt    5  15043.885 ±  71.861  us/op

Most tests agree with the observations from above: low N means low N*Q,
means less opportunity for parallel; and in reverse, high N means high
N*Q, means parallel opportunities. For this case, break-even happens
somewhere within N of [10K; 100K].

But see how count=20_000_000 is the special snowflake here. My guess
that happens because both tests really bottleneck on LLC->memory
bandwidth at a very large N -- the workload bashes the system down with
never-ending barrage of writes! (-prof perfnorm corroborates that:
parallel has lots of cache misses, very high CPI, etc.)

Still does not explain the difference between Windows/Linux running on
the same machine though. Suspicion: Linux scheduler is better at ramping
up and scheduling parallel threads, so parallel algos run better there.
It would be more enlightening to juggle C and P, and introduce the time
delay between ops to cool down threads. This would help to poke into
scheduler's performance.

(I should probably turn this thread into another JMH Sample, eh...)

Thanks,
-Aleksey