Need some help for a book intro to JMH
Aleksey Shipilev
aleksey.shipilev at gmail.com
Fri Aug 12 21:46:03 UTC 2016
Hi Bruce,
On 08/12/2016 08:33 PM, Bruce Eckel wrote:
> So my questions are:
> 1. Is Arrays.setAll() vs Arrays.parallelSetAll() just too tricky and subtle
> as an example, or are there JMH annotations I can add to fix this?
Yes, too tricky, see below.
> 2. Is there a better introductory example I should be using? (Ideally such
> an example would also show problems when using simple timing).
We have been trying (and humbly succeeding) to teach users the basic
methodology, what to expect from a bad benchmark, and how JMH can help
via the runnable JMH Samples:
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/
I don't think there is a single good example that drives the point home.
> Here is the introductory section (so far), including test results:
I can proof-read this more carefully in a private conversation, if you
want.
> ```java
> // verifying/jmhtests/ParallelSetAll.java
> package verifying.jmhtests;
> import java.util.*;
> import org.openjdk.jmh.annotations.*;
>
> @State(Scope.Thread)
> public class ParallelSetAll {
> private long[] la;
> @Setup
> public void setup() {
> la = new long[20_000_000];
> }
> @Benchmark
> public void setAll() {
> Arrays.setAll(la, n -> n);
> }
> @Benchmark
> public void parallelSetAll() {
> Arrays.parallelSetAll(la, n -> n);
> }
> }
> ```
The benchmark code looks okay, but this is not the complete benchmarking
story for parallel workloads.
The problem with these one-off tests is that they may exercise a
particular sweet spot on a particular machine. It's weird that people
try to collect more data by running on different machines, when much
more interesting degrees of freedom are available:
a) C: the number of *client* threads that do ops;
b) P: the amount of parallelism used by parallel algo;
c) N: the size of the array: 10^(2*k), where k=1..7 is usually okay to
exercise different cache footprints;
d) Q: the cost of the setter op;
This C/P/N/Q model surfaced during early JDK 8 Lambda development, and
we saw most parallel Streams operations (and parallelSetAll is very
similar to that) agree with these:
a) N*Q (basically, the amount of work) is critical to parallel
performance. If there is less than N*Q work, parallel algo may work slower.
b) There are cases where op is so contended, there is no help from
parallelism at all, no matter how large N*Q is.
c) When C is high, P is much less relevant (IOW, a lot of external
parallelism makes internal parallelism redundant). Moreover, in some
cases, the cost the parallel decomposition makes C clients running a
parallel algo run slower than the same C clients running sequential code.
Maurice Naftalin did a nice section in his "Mastering Lambdas: Java
Programming in a Multicore World". Doug Lea has a nice guidance on when
to use parallel streams too:
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html.
So, if you don't want to explain and follow up on those bits while
benchmarking parallelSetAll, I would suggest to move on to something
less parallel :)
But if you do want to go this particular rabbit hole, let's say we juggle N:
@Param({"1", "100", "10000", "1000000", "20000000"})
int count;
@Setup
public void setup() {
la = new long[count];
}
On my 4-core i7-4790K, JDK 8u101, Linux x86_64 (time/op, the lower the
better):
Benchmark (count) Mode Cnt Score Error Units
parallelSetAll 1 avgt 5 0.035 ± 0.001 us/op
parallelSetAll 10 avgt 5 7.219 ± 0.119 us/op
parallelSetAll 100 avgt 5 4.656 ± 0.052 us/op
parallelSetAll 1000 avgt 5 4.368 ± 0.112 us/op
parallelSetAll 10000 avgt 5 9.109 ± 0.141 us/op
parallelSetAll 100000 avgt 5 21.096 ± 0.243 us/op
parallelSetAll 1000000 avgt 5 211.409 ± 49.143 us/op
parallelSetAll 20000000 avgt 5 15069.037 ± 301.859 us/op
setAll 1 avgt 5 0.001 ± 0.001 us/op
setAll 10 avgt 5 0.005 ± 0.001 us/op
setAll 100 avgt 5 0.031 ± 0.001 us/op
setAll 1000 avgt 5 0.304 ± 0.001 us/op
setAll 10000 avgt 5 3.167 ± 0.069 us/op
setAll 100000 avgt 5 34.891 ± 0.067 us/op
setAll 1000000 avgt 5 433.957 ± 1.957 us/op
setAll 20000000 avgt 5 15043.885 ± 71.861 us/op
Most tests agree with the observations from above: low N means low N*Q,
means less opportunity for parallel; and in reverse, high N means high
N*Q, means parallel opportunities. For this case, break-even happens
somewhere within N of [10K; 100K].
But see how count=20_000_000 is the special snowflake here. My guess
that happens because both tests really bottleneck on LLC->memory
bandwidth at a very large N -- the workload bashes the system down with
never-ending barrage of writes! (-prof perfnorm corroborates that:
parallel has lots of cache misses, very high CPI, etc.)
Still does not explain the difference between Windows/Linux running on
the same machine though. Suspicion: Linux scheduler is better at ramping
up and scheduling parallel threads, so parallel algos run better there.
It would be more enlightening to juggle C and P, and introduce the time
delay between ops to cool down threads. This would help to poke into
scheduler's performance.
(I should probably turn this thread into another JMH Sample, eh...)
Thanks,
-Aleksey
More information about the jmh-dev
mailing list