Hi Ben, Thanks. I anticipated a performance hit but not necessarily a 10x. Without looking at the generated code of the benchmark method it is hard to be sure [*], but i believe the fence is interfering with loop unrolling and/or vectorization, the comparative differences between byte and int may be related to vectorization (for byte there may be less or limited support for vectorization). How about we now try another experiment commenting out the @DontInline on the fence method and re-run the benchmarks. From Peter’s observations and Vladimir’s analysis we should be able to remove that, or even, contrary to what we initial expected when adding this feature, change to @ForceInline! Thanks, Paul. [*] If you are running on linux you can use the excellent JMH perfasm feature to dump the hot parts of HotSpots generated code.
On Feb 8, 2018, at 8:22 AM, Ben Walsh <ben_walsh@uk.ibm.com> wrote:
Hi Paul,
Following up with the requested loop and vectorization benchmarks ...
(Do the vectorization benchmark results imply that the Hotspot compiler has been unable to perform the vectorization optimisation due to the presence of the reachabilityFence ?)
-----------------------------------------------------------------------------------------------------------------------
Loop Benchmarking ---- ------------
package org.sample;
import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.Level; import org.openjdk.jmh.annotations.Param; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.Setup; import org.openjdk.jmh.annotations.State;
import java.nio.ByteBuffer;
@State(Scope.Benchmark) public class ByteBufferBenchmark {
@Param({"1", "10", "100", "1000", "10000"}) public int L;
@State(Scope.Benchmark) public static class ByteBufferContainer {
ByteBuffer bb;
@Setup(Level.Invocation) public void initByteBuffer() { bb = ByteBuffer.allocateDirect(10000); }
ByteBuffer getByteBuffer() { return bb; } }
@Benchmark public ByteBuffer benchmark_byte_buffer_put(ByteBufferContainer bbC) {
ByteBuffer bb = bbC.getByteBuffer();
for (int i = 0; i < L; i++) { bb.put((byte)i); }
return bb; }
}
Without Changes
Benchmark (L) Mode Cnt Score Error Units ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 29303145.752 ± 635979.750 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 24260859.017 ± 528891.303 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 8512366.637 ± 136615.070 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 1323756.037 ± 21485.369 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 145965.305 ± 1301.469 ops/s
With Changes
Benchmark (L) Mode Cnt Score Error Units Impact ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 28893540.122 ± 754554.747 ops/s -1.398% ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 15317696.355 ± 231621.608 ops/s -36.863% ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 2546599.578 ± 32136.873 ops/s -70.084% ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 288832.514 ± 3854.522 ops/s -78.181% ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 29747.386 ± 214.831 ops/s -79.620%
-----------------------------------------------------------------------------------------------------------------------
Vectorization Benchmarking ------------- ------------
package org.sample;
import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.Level; import org.openjdk.jmh.annotations.Param; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.Setup; import org.openjdk.jmh.annotations.State;
import java.nio.ByteBuffer;
@State(Scope.Benchmark) public class ByteBufferBenchmark {
@Param({"1", "10", "100", "1000", "10000"}) public int L;
@State(Scope.Benchmark) public static class ByteBufferContainer {
ByteBuffer bb;
@Setup(Level.Invocation) public void initByteBuffer() { bb = ByteBuffer.allocateDirect(4 * 10000);
for (int i = 0; i < 10000; i++) { bb.putInt(i); } }
ByteBuffer getByteBuffer() { return bb; }
}
@Benchmark public int benchmark_byte_buffer_put(ByteBufferContainer bbC) {
ByteBuffer bb = bbC.getByteBuffer();
bb.position(0);
int sum = 0;
for (int i = 0; i < L; i++) { sum += bb.getInt(); }
return sum;
}
}
Without Changes
Benchmark (L) Mode Cnt Score Error Units ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 29677205.748 ± 544721.142 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 18219951.454 ± 320724.793 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 7767650.826 ± 121798.910 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 1646075.010 ± 9804.499 ops/s ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 183489.418 ± 1355.967 ops/s
With Changes
Benchmark (L) Mode Cnt Score Error Units Impact ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 15230086.695 ± 390174.190 ops/s -48.681% ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 8126310.728 ± 123661.342 ops/s -55.399% ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 1582699.233 ± 7278.744 ops/s -79.624% ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 179726.465 ± 802.333 ops/s -89.082% ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 18327.049 ± 9.506 ops/s -90.012%
NB : For reference - for this and previous benchmarking results ...
"Without Changes" and "With Changes" - java -version ...
openjdk version "10-internal" 2018-03-20 OpenJDK Runtime Environment (build 10-internal+0-adhoc.walshbp.jdk) OpenJDK 64-Bit Server VM (build 10-internal+0-adhoc.walshbp.jdk, mixed mode)
-----------------------------------------------------------------------------------------------------------------------
Regards, Ben Walsh