Need some help for a book intro to JMH

Sat Aug 13 21:07:00 UTC 2016

Aleksey, thanks very much for the detailed answer. Right now I'm still
inclined to use the example precisely because it shows the non-trivial
nature of benchmarking for an example that looks like it ought to be
trivial.

I will incorporate your feedback into the section and when I'm done I'll
send it to you for review. I really appreciate it and also all your work on
JMH. In the past I tried to create a simple benchmarking tool and messed it
up royally, so it's great to be using yours.

-- Bruce Eckel
www.MindviewInc.com <http://www.mindviewinc.com/>
Blog: BruceEckel.github.io
www.WinterTechForum.com
www.AtomicScala.com
www.Reinventing-Business.com
http://www.TrustOrganizations.com <http://www.ScalaSummit.com>

On Fri, Aug 12, 2016 at 3:46 PM, Aleksey Shipilev <
aleksey.shipilev at gmail.com> wrote:

> Hi Bruce,
>
> On 08/12/2016 08:33 PM, Bruce Eckel wrote:
> > So my questions are:
> > 1. Is Arrays.setAll() vs Arrays.parallelSetAll() just too tricky and
> subtle
> > as an example, or are there JMH annotations I can add to fix this?
>
> Yes, too tricky, see below.
>
> > 2. Is there a better introductory example I should be using? (Ideally
> such
> > an example would also show problems when using simple timing).
>
> We have been trying (and humbly succeeding) to teach users the basic
> methodology, what to expect from a bad benchmark, and how JMH can help
> via the runnable JMH Samples:
>  http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-
> samples/src/main/java/org/openjdk/jmh/samples/
>
> I don't think there is a single good example that drives the point home.
>
> > Here is the introductory section (so far), including test results:
>
> I can proof-read this more carefully in a private conversation, if you
> want.
>
>
> > ```java
> > // verifying/jmhtests/ParallelSetAll.java
> > package verifying.jmhtests;
> > import java.util.*;
> > import org.openjdk.jmh.annotations.*;
> >
> > @State(Scope.Thread)
> > public class ParallelSetAll {
> >   private long[] la;
> >   @Setup
> >   public void setup() {
> >     la = new long[20_000_000];
> >   }
> >   @Benchmark
> >   public void setAll() {
> >     Arrays.setAll(la, n -> n);
> >   }
> >   @Benchmark
> >   public void parallelSetAll() {
> >     Arrays.parallelSetAll(la, n -> n);
> >   }
> > }
> > ```
>
> The benchmark code looks okay, but this is not the complete benchmarking
> story for parallel workloads.
>
> The problem with these one-off tests is that they may exercise a
> particular sweet spot on a particular machine. It's weird that people
> try to collect more data by running on different machines, when much
> more interesting degrees of freedom are available:
>  a) C: the number of *client* threads that do ops;
>  b) P: the amount of parallelism used by parallel algo;
>  c) N: the size of the array: 10^(2*k), where k=1..7 is usually okay to
> exercise different cache footprints;
>  d) Q: the cost of the setter op;
>
> This C/P/N/Q model surfaced during early JDK 8 Lambda development, and
> we saw most parallel Streams operations (and parallelSetAll is very
> similar to that) agree with these:
>  a) N*Q (basically, the amount of work) is critical to parallel
> performance. If there is less than N*Q work, parallel algo may work slower.
>  b) There are cases where op is so contended, there is no help from
> parallelism at all, no matter how large N*Q is.
>  c) When C is high, P is much less relevant (IOW, a lot of external
> parallelism makes internal parallelism redundant). Moreover, in some
> cases, the cost the parallel decomposition makes C clients running a
> parallel algo run slower than the same C clients running sequential code.
>
> Maurice Naftalin did a nice section in his "Mastering Lambdas: Java
> Programming in a Multicore World". Doug Lea has a nice guidance on when
> to use parallel streams too:
> http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html.
>
> So, if you don't want to explain and follow up on those bits while
> benchmarking parallelSetAll, I would suggest to move on to something
> less parallel :)
>
> But if you do want to go this particular rabbit hole, let's say we juggle
> N:
>
>     @Param({"1", "100", "10000", "1000000", "20000000"})
>     int count;
>
>     @Setup
>     public void setup() {
>         la = new long[count];
>     }
>
> On my 4-core i7-4790K, JDK 8u101, Linux x86_64 (time/op, the lower the
> better):
>
>   Benchmark        (count)  Mode  Cnt      Score     Error  Units
>
>   parallelSetAll         1  avgt    5      0.035 ±   0.001  us/op
>   parallelSetAll        10  avgt    5      7.219 ±   0.119  us/op
>   parallelSetAll       100  avgt    5      4.656 ±   0.052  us/op
>   parallelSetAll      1000  avgt    5      4.368 ±   0.112  us/op
>   parallelSetAll     10000  avgt    5      9.109 ±   0.141  us/op
>   parallelSetAll    100000  avgt    5     21.096 ±   0.243  us/op
>   parallelSetAll   1000000  avgt    5    211.409 ±  49.143  us/op
>   parallelSetAll  20000000  avgt    5  15069.037 ± 301.859  us/op
>
>   setAll                 1  avgt    5      0.001 ±   0.001  us/op
>   setAll                10  avgt    5      0.005 ±   0.001  us/op
>   setAll               100  avgt    5      0.031 ±   0.001  us/op
>   setAll              1000  avgt    5      0.304 ±   0.001  us/op
>   setAll             10000  avgt    5      3.167 ±   0.069  us/op
>   setAll            100000  avgt    5     34.891 ±   0.067  us/op
>   setAll           1000000  avgt    5    433.957 ±   1.957  us/op
>   setAll          20000000  avgt    5  15043.885 ±  71.861  us/op
>
> Most tests agree with the observations from above: low N means low N*Q,
> means less opportunity for parallel; and in reverse, high N means high
> N*Q, means parallel opportunities. For this case, break-even happens
> somewhere within N of [10K; 100K].
>
> But see how count=20_000_000 is the special snowflake here. My guess
> that happens because both tests really bottleneck on LLC->memory
> bandwidth at a very large N -- the workload bashes the system down with
> never-ending barrage of writes! (-prof perfnorm corroborates that:
> parallel has lots of cache misses, very high CPI, etc.)
>
> Still does not explain the difference between Windows/Linux running on
> the same machine though. Suspicion: Linux scheduler is better at ramping
> up and scheduling parallel threads, so parallel algos run better there.
> It would be more enlightening to juggle C and P, and introduce the time
> delay between ops to cool down threads. This would help to poke into
> scheduler's performance.
>
> (I should probably turn this thread into another JMH Sample, eh...)
>
> Thanks,
> -Aleksey
>
>
>