Need some help for a book intro to JMH
Bruce Eckel
bruceteckel at gmail.com
Sat Aug 13 21:07:00 UTC 2016
Aleksey, thanks very much for the detailed answer. Right now I'm still
inclined to use the example precisely because it shows the non-trivial
nature of benchmarking for an example that looks like it ought to be
trivial.
I will incorporate your feedback into the section and when I'm done I'll
send it to you for review. I really appreciate it and also all your work on
JMH. In the past I tried to create a simple benchmarking tool and messed it
up royally, so it's great to be using yours.
-- Bruce Eckel
www.MindviewInc.com <http://www.mindviewinc.com/>
Blog: BruceEckel.github.io
www.WinterTechForum.com
www.AtomicScala.com
www.Reinventing-Business.com
http://www.TrustOrganizations.com <http://www.ScalaSummit.com>
On Fri, Aug 12, 2016 at 3:46 PM, Aleksey Shipilev <
aleksey.shipilev at gmail.com> wrote:
> Hi Bruce,
>
> On 08/12/2016 08:33 PM, Bruce Eckel wrote:
> > So my questions are:
> > 1. Is Arrays.setAll() vs Arrays.parallelSetAll() just too tricky and
> subtle
> > as an example, or are there JMH annotations I can add to fix this?
>
> Yes, too tricky, see below.
>
> > 2. Is there a better introductory example I should be using? (Ideally
> such
> > an example would also show problems when using simple timing).
>
> We have been trying (and humbly succeeding) to teach users the basic
> methodology, what to expect from a bad benchmark, and how JMH can help
> via the runnable JMH Samples:
> http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-
> samples/src/main/java/org/openjdk/jmh/samples/
>
> I don't think there is a single good example that drives the point home.
>
> > Here is the introductory section (so far), including test results:
>
> I can proof-read this more carefully in a private conversation, if you
> want.
>
>
> > ```java
> > // verifying/jmhtests/ParallelSetAll.java
> > package verifying.jmhtests;
> > import java.util.*;
> > import org.openjdk.jmh.annotations.*;
> >
> > @State(Scope.Thread)
> > public class ParallelSetAll {
> > private long[] la;
> > @Setup
> > public void setup() {
> > la = new long[20_000_000];
> > }
> > @Benchmark
> > public void setAll() {
> > Arrays.setAll(la, n -> n);
> > }
> > @Benchmark
> > public void parallelSetAll() {
> > Arrays.parallelSetAll(la, n -> n);
> > }
> > }
> > ```
>
> The benchmark code looks okay, but this is not the complete benchmarking
> story for parallel workloads.
>
> The problem with these one-off tests is that they may exercise a
> particular sweet spot on a particular machine. It's weird that people
> try to collect more data by running on different machines, when much
> more interesting degrees of freedom are available:
> a) C: the number of *client* threads that do ops;
> b) P: the amount of parallelism used by parallel algo;
> c) N: the size of the array: 10^(2*k), where k=1..7 is usually okay to
> exercise different cache footprints;
> d) Q: the cost of the setter op;
>
> This C/P/N/Q model surfaced during early JDK 8 Lambda development, and
> we saw most parallel Streams operations (and parallelSetAll is very
> similar to that) agree with these:
> a) N*Q (basically, the amount of work) is critical to parallel
> performance. If there is less than N*Q work, parallel algo may work slower.
> b) There are cases where op is so contended, there is no help from
> parallelism at all, no matter how large N*Q is.
> c) When C is high, P is much less relevant (IOW, a lot of external
> parallelism makes internal parallelism redundant). Moreover, in some
> cases, the cost the parallel decomposition makes C clients running a
> parallel algo run slower than the same C clients running sequential code.
>
> Maurice Naftalin did a nice section in his "Mastering Lambdas: Java
> Programming in a Multicore World". Doug Lea has a nice guidance on when
> to use parallel streams too:
> http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html.
>
> So, if you don't want to explain and follow up on those bits while
> benchmarking parallelSetAll, I would suggest to move on to something
> less parallel :)
>
> But if you do want to go this particular rabbit hole, let's say we juggle
> N:
>
> @Param({"1", "100", "10000", "1000000", "20000000"})
> int count;
>
> @Setup
> public void setup() {
> la = new long[count];
> }
>
> On my 4-core i7-4790K, JDK 8u101, Linux x86_64 (time/op, the lower the
> better):
>
> Benchmark (count) Mode Cnt Score Error Units
>
> parallelSetAll 1 avgt 5 0.035 ± 0.001 us/op
> parallelSetAll 10 avgt 5 7.219 ± 0.119 us/op
> parallelSetAll 100 avgt 5 4.656 ± 0.052 us/op
> parallelSetAll 1000 avgt 5 4.368 ± 0.112 us/op
> parallelSetAll 10000 avgt 5 9.109 ± 0.141 us/op
> parallelSetAll 100000 avgt 5 21.096 ± 0.243 us/op
> parallelSetAll 1000000 avgt 5 211.409 ± 49.143 us/op
> parallelSetAll 20000000 avgt 5 15069.037 ± 301.859 us/op
>
> setAll 1 avgt 5 0.001 ± 0.001 us/op
> setAll 10 avgt 5 0.005 ± 0.001 us/op
> setAll 100 avgt 5 0.031 ± 0.001 us/op
> setAll 1000 avgt 5 0.304 ± 0.001 us/op
> setAll 10000 avgt 5 3.167 ± 0.069 us/op
> setAll 100000 avgt 5 34.891 ± 0.067 us/op
> setAll 1000000 avgt 5 433.957 ± 1.957 us/op
> setAll 20000000 avgt 5 15043.885 ± 71.861 us/op
>
> Most tests agree with the observations from above: low N means low N*Q,
> means less opportunity for parallel; and in reverse, high N means high
> N*Q, means parallel opportunities. For this case, break-even happens
> somewhere within N of [10K; 100K].
>
> But see how count=20_000_000 is the special snowflake here. My guess
> that happens because both tests really bottleneck on LLC->memory
> bandwidth at a very large N -- the workload bashes the system down with
> never-ending barrage of writes! (-prof perfnorm corroborates that:
> parallel has lots of cache misses, very high CPI, etc.)
>
> Still does not explain the difference between Windows/Linux running on
> the same machine though. Suspicion: Linux scheduler is better at ramping
> up and scheduling parallel threads, so parallel algos run better there.
> It would be more enlightening to juggle C and P, and introduce the time
> delay between ops to cool down threads. This would help to poke into
> scheduler's performance.
>
> (I should probably turn this thread into another JMH Sample, eh...)
>
> Thanks,
> -Aleksey
>
>
>
More information about the jmh-dev
mailing list