[vectorIntrinsics] RFR: 8244490: [vector] Move Vector API micro benchmarks under test/micro
Marcus G K Williams
mgkwill at openjdk.java.net
Tue Jun 8 16:02:34 UTC 2021
On Thu, 6 May 2021 17:54:37 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:
>> Vector API micro benchmarks are currently located under test/jdk/jdk/incubator/vector/benchmark which makes them rather as "dead" code without possibility to built and run.
>>
>> The proper location for micro benchmarks is actually test/micro/ directory.
>> It would be nice to move Vector API benchmarks there so they can be built automatically as part 'test-image' make target.
>> Once they are built they can be run as
>> make run-test TEST=micro:BENCHMARK_TEST_NAME
>
> I looked more in detail, there are some immediate issues, and i think an architectural issue with the split.
>
> The more immediate issues are:
>
> 1. Some benchmark tests fail to build via `make`. Some are related to bad ASCII characters in comments, some due to compiler warnings, and others due to external dependencies in the `pom.xml` (namely `org.junit.jupiter:junit-jupiter-api`). I think all those are fixable. (See https://openjdk.java.net/groups/build/doc/testing.html for configuring, compiling, and executing JMH tests.)
> 2. The structure under `test/micro` is misleading, and does not follow directory to package name convention. This is also fixable. I think we need to remove the maven project and merge in under `test/micro/org/openjdk/bench/jdk/incubator/vector`. We remove the maven project and place the non-generated benchmark files directly under the aforementioned directory (and fix the ones that fail to compile). Generated benchmarks could be placed under, say, `operation`.
>
> Initially I thought it was fine that generation scripts were split between unit tests and performance tests, but architecturally, i now think it better to keep one set of bash scripts for generated code, maintained under the `vector` module. These are complex and splitting will cause divergence. These scripts can generate the `operation` performance tests under the `test/micro` directory at the appropriate location.
Thanks again for your review @PaulSandoz. I set this aside until now due to priority work for jdk17 freeze.
Updated per review.
1. Reverted original draft refactor, to keep perf. test generation code in original location.
2. Changed location of generated perf. tests to `test/micro/org/openjdk/bench/jdk/incubator/vector/operation`
3. Moved non-generated tests and respective directories to `test/micro/org/openjdk/bench/jdk/incubator/vector/`.
I.E.:
`test/micro/org/openjdk/bench/jdk/incubator/vector/bigdata`
`test/micro/org/openjdk/bench/jdk/incubator/vector/crypto`
`test/micro/org/openjdk/bench/jdk/incubator/vector/utf8`
4. Fixed broken tests, now all tests build.
4a. Fixed comments that used invalid characters
4b. Refactored to remove junit dependency
4c. Removed static from some `@benchmark` methods to fix `static method should be qualified by type name, VectorDistance,
instead of by an expression` compile warning.
5. Updated package location of all moved files and generated files to `org.openjdk.bench.jdk.incubator.vector.<dir name>`
6. Removed pom.xml and other no longer needed files.
I can now build and run all tests using:
`make test TEST="micro:org.openjdk.bench.jdk.incubator.vector" MICRO="FORK=2;WARMUP_ITER=5;" CONF=linux-x86_64-server-release`
# Run complete. Total time: 04:40:06
Benchmark (ARRAY_LENGTH) (dataSize) (maxBytes) (size) Mode Cnt Score Error Units
o.o.b.j.i.v.bigdata.BooleanArrayCheck.filterAll 1024 N/A N/A N/A thrpt 5 3884.266 ? 110.491 ops/ms
o.o.b.j.i.v.bigdata.BooleanArrayCheck.filterAll_vec 1024 N/A N/A N/A thrpt 5 16060.256 ? 5485.055 ops/ms
o.o.b.j.i.v.bigdata.ValueRangeCheckAndCastL2I.castL2I 1024 N/A N/A N/A thrpt 5 1470.185 ? 87.374 ops/ms
o.o.b.j.i.v.bigdata.ValueRangeCheckAndCastL2I.castL2I_vec 1024 N/A N/A N/A thrpt 5 1725.998 ? 763.099 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedScalarDouble N/A N/A N/A N/A thrpt 5 3466.435 ? 61.524 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedScalarFloat N/A N/A N/A N/A thrpt 5 2949.337 ? 167.843 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedVectorDouble128 N/A N/A N/A N/A thrpt 5 3831.298 ? 670.538 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedVectorDouble256 N/A N/A N/A N/A thrpt 5 6916.338 ? 1240.536 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedVectorDoubleMax N/A N/A N/A N/A thrpt 5 6922.271 ? 1193.232 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedVectorFloat128 N/A N/A N/A N/A thrpt 5 7667.293 ? 1627.940 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedVectorFloat256 N/A N/A N/A N/A thrpt 5 12881.137 ? 2561.933 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilOptimizedVectorFloatMax N/A N/A N/A N/A thrpt 5 12870.951 ? 2536.147 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilScalarDouble N/A N/A N/A N/A thrpt 5 3357.433 ? 77.093 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilScalarFloat N/A N/A N/A N/A thrpt 5 2812.819 ? 321.380 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilVectorDouble128 N/A N/A N/A N/A thrpt 5 3834.837 ? 865.640 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilVectorDouble256 N/A N/A N/A N/A thrpt 5 6894.900 ? 1743.191 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilVectorDoubleMax N/A N/A N/A N/A thrpt 5 6906.721 ? 1718.390 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilVectorFloat128 N/A N/A N/A N/A thrpt 5 7443.532 ? 1897.726 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilVectorFloat256 N/A N/A N/A N/A thrpt 5 12136.630 ? 2997.062 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.cosinesimilVectorFloatMax N/A N/A N/A N/A thrpt 5 12155.037 ? 2943.135 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.l2SquaredScalar N/A N/A N/A N/A thrpt 5 3615.867 ? 188.036 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.l2SquaredVectorDouble128 N/A N/A N/A N/A thrpt 5 4036.363 ? 1023.672 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.l2SquaredVectorDouble256 N/A N/A N/A N/A thrpt 5 7766.613 ? 2090.510 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.l2SquaredVectorDoubleMax N/A N/A N/A N/A thrpt 5 7758.038 ? 2196.819 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.l2SquaredVectorFloat128 N/A N/A N/A N/A thrpt 5 8603.290 ? 2372.100 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.l2SquaredVectorFloat256 N/A N/A N/A N/A thrpt 5 17002.021 ? 5948.044 ops/ms
o.o.b.j.i.v.bigdata.VectorDistance.l2SquaredVectorFloatMax N/A N/A N/A N/A thrpt 5 17027.964 ? 5759.068 ops/ms
o.o.b.j.i.v.crypto.ChaChaBench.encrypt128 N/A 16384 N/A N/A thrpt 8 21730.541 ? 5043.050 ops/s
o.o.b.j.i.v.crypto.ChaChaBench.encrypt128 N/A 65536 N/A N/A thrpt 8 5496.309 ? 1245.802 ops/s
o.o.b.j.i.v.crypto.ChaChaBench.encrypt256 N/A 16384 N/A N/A thrpt 8 40332.908 ? 12166.064 ops/s
o.o.b.j.i.v.crypto.ChaChaBench.encrypt256 N/A 65536 N/A N/A thrpt 8 10304.809 ? 2968.075 ops/s
o.o.b.j.i.v.crypto.ChaChaBench.encrypt512 N/A 16384 N/A N/A thrpt 8 285.604 ? 82.992 ops/s
o.o.b.j.i.v.crypto.ChaChaBench.encrypt512 N/A 65536 N/A N/A thrpt 8 58.590 ? 16.202 ops/s
o.o.b.j.i.v.crypto.Poly1305Bench.auth128 N/A 16384 N/A N/A thrpt 8 201.353 ? 67.948 ops/s
o.o.b.j.i.v.crypto.Poly1305Bench.auth128 N/A 65536 N/A N/A thrpt 8 49.908 ? 15.856 ops/s
o.o.b.j.i.v.crypto.Poly1305Bench.auth256 N/A 16384 N/A N/A thrpt 8 364.051 ? 108.472 ops/s
o.o.b.j.i.v.crypto.Poly1305Bench.auth256 N/A 65536 N/A N/A thrpt 8 99.809 ? 33.255 ops/s
o.o.b.j.i.v.crypto.Poly1305Bench.auth512 N/A 16384 N/A N/A thrpt 8 472.031 ? 149.266 ops/s
o.o.b.j.i.v.crypto.Poly1305Bench.auth512 N/A 65536 N/A N/A thrpt 8 120.617 ? 35.399 ops/s
o.o.b.j.i.v.operation.Byte128Vector.ABS N/A N/A N/A 1024 thrpt 5 16498.980 ? 3687.013 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ABSMasked N/A N/A N/A 1024 thrpt 5 13420.584 ? 4553.559 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ADD N/A N/A N/A 1024 thrpt 5 14389.853 ? 2559.982 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ADDLanes N/A N/A N/A 1024 thrpt 5 5408.304 ? 1057.194 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ADDMasked N/A N/A N/A 1024 thrpt 5 12571.391 ? 3170.842 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ADDMaskedLanes N/A N/A N/A 1024 thrpt 5 4300.352 ? 911.484 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.AND N/A N/A N/A 1024 thrpt 5 14197.257 ? 2830.585 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ANDLanes N/A N/A N/A 1024 thrpt 5 5437.373 ? 607.144 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ANDMasked N/A N/A N/A 1024 thrpt 5 12535.445 ? 3085.079 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ANDMaskedLanes N/A N/A N/A 1024 thrpt 5 4251.096 ? 1278.371 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.AND_NOT N/A N/A N/A 1024 thrpt 5 11770.249 ? 2634.626 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.AND_NOTMasked N/A N/A N/A 1024 thrpt 5 10638.213 ? 3156.234 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ASHR N/A N/A N/A 1024 thrpt 5 2698.630 ? 866.569 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ASHRMasked N/A N/A N/A 1024 thrpt 5 2438.774 ? 716.473 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ASHRMaskedShift N/A N/A N/A 1024 thrpt 5 4884.817 ? 1014.613 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.ASHRShift N/A N/A N/A 1024 thrpt 5 6436.859 ? 990.331 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.BITWISE_BLEND N/A N/A N/A 1024 thrpt 5 11348.187 ? 3386.170 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.BITWISE_BLENDMasked N/A N/A N/A 1024 thrpt 5 9926.890 ? 4263.756 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.DIV N/A N/A N/A 1024 thrpt 5 169.706 ? 165.121 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.DIVMasked N/A N/A N/A 1024 thrpt 5 97.473 ? 77.981 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.EQ N/A N/A N/A 1024 thrpt 5 1486.251 ? 2582.369 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.FIRST_NONZERO N/A N/A N/A 1024 thrpt 5 491.076 ? 582.534 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.FIRST_NONZEROMasked N/A N/A N/A 1024 thrpt 5 396.484 ? 535.547 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.GE N/A N/A N/A 1024 thrpt 5 1455.821 ? 2516.670 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.GT N/A N/A N/A 1024 thrpt 5 1496.344 ? 2613.716 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.IS_DEFAULT N/A N/A N/A 1024 thrpt 5 1447.973 ? 2375.492 ops/ms
o.o.b.j.i.v.operation.Byte128Vector.IS_NEGATIVE N/A N/A N/A 1024 thrpt 5 1429.487 ? 2438.604 ops/ms
...
o.o.b.j.i.v.operation.ShortScalar.scatter256 N/A N/A N/A 1024 thrpt 5 1348.385 ? 140.870 ops/ms
o.o.b.j.i.v.operation.ShortScalar.scatter512 N/A N/A N/A 1024 thrpt 5 1444.297 ? 101.135 ops/ms
o.o.b.j.i.v.operation.ShortScalar.scatterBase0 N/A N/A N/A 1024 thrpt 5 1682.814 ? 62.974 ops/ms
o.o.b.j.i.v.operation.ShortScalar.zero N/A N/A N/A 1024 thrpt 5 24059.679 ? 2452.079 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI128 N/A N/A N/A 64 thrpt 5 124.008 ? 167.291 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI128 N/A N/A N/A 1024 thrpt 5 8.291 ? 11.679 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI128 N/A N/A N/A 65536 thrpt 5 0.132 ? 0.194 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI256 N/A N/A N/A 64 thrpt 5 104.170 ? 149.013 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI256 N/A N/A N/A 1024 thrpt 5 5.604 ? 9.367 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI256 N/A N/A N/A 65536 thrpt 5 0.094 ? 0.132 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI512 N/A N/A N/A 64 thrpt 5 12.467 ? 14.546 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI512 N/A N/A N/A 1024 thrpt 5 0.781 ? 0.915 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI512 N/A N/A N/A 65536 thrpt 5 0.012 ? 0.014 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI64 N/A N/A N/A 64 thrpt 5 41.097 ? 48.712 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI64 N/A N/A N/A 1024 thrpt 5 2.555 ? 3.112 ops/ms
o.o.b.j.i.v.operation.SortVector.sortVectorI64 N/A N/A N/A 65536 thrpt 5 0.041 ? 0.048 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.scalar N/A N/A N/A 64 thrpt 5 34764.547 ? 2124.872 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.scalar N/A N/A N/A 1024 thrpt 5 2604.449 ? 63.086 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.scalar N/A N/A N/A 4096 thrpt 5 653.640 ? 24.208 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.vectorInt N/A N/A N/A 64 thrpt 5 68950.388 ? 27430.053 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.vectorInt N/A N/A N/A 1024 thrpt 5 15432.548 ? 7166.145 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.vectorInt N/A N/A N/A 4096 thrpt 5 4068.423 ? 1865.103 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.vectorShort N/A N/A N/A 64 thrpt 5 52088.764 ? 56731.840 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.vectorShort N/A N/A N/A 1024 thrpt 5 4369.735 ? 7629.283 ops/ms
o.o.b.j.i.v.operation.SumOfUnsignedBytes.vectorShort N/A N/A N/A 4096 thrpt 5 2692.982 ? 3791.073 ops/ms
o.o.b.j.i.v.utf8.DecodeBench.decodeScalar N/A 32768 1 N/A thrpt 8 54369.096 ? 1063.656 ops/s
o.o.b.j.i.v.utf8.DecodeBench.decodeScalar N/A 8388608 1 N/A thrpt 8 151.256 ? 6.312 ops/s
o.o.b.j.i.v.utf8.DecodeBench.decodeVector N/A 32768 1 N/A thrpt 8 6452.803 ? 2950.714 ops/s
o.o.b.j.i.v.utf8.DecodeBench.decodeVector N/A 8388608 1 N/A thrpt 8 24.918 ? 12.696 ops/s
o.o.b.j.i.v.utf8.DecodeBench.decodeVectorASCII N/A 32768 1 N/A thrpt 8 261340.791 ? 102913.900 ops/s
o.o.b.j.i.v.utf8.DecodeBench.decodeVectorASCII N/A 8388608 1 N/A thrpt 8 293.033 ? 74.431 ops/s
I am getting `java.lang.RuntimeException: Incorrect result` for DecodeBench.java test run:
java.lang.RuntimeException: Incorrect result
at org.openjdk.bench.jdk.incubator.vector.utf8.DecodeBench.tearDownInvocation(DecodeBench.java:204)
at org.openjdk.bench.jdk.incubator.vector.utf8.jmh_generated.DecodeBench_decodeVectorASCII_jmhTest.decodeVectorASCII_thrpt_jmhStub(DecodeBench_decodeVectorASCII_jmhTest.java:127)
at org.openjdk.bench.jdk.incubator.vector.utf8.jmh_generated.DecodeBench_decodeVectorASCII_jmhTest.decodeVectorASCII_Throughput(DecodeBench_decodeVectorASCII_jmhTest.java:85)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:831)
I'm looking into it, I'm not sure if this is a lack of hardware issue, test design, or bug. But it should not be failing https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/utf8/DecodeBench.java#L201
@TearDown(Level.Invocation)
public void tearDownInvocation() {
out = new String(dst.array());
if (!in.equals(out)) {
System.out.println("in = (" + in.length() + ") "" + arrayToString(in.getBytes()) + """);
System.out.println("out = (" + out.length() + ") "" + arrayToString(out.getBytes()) + """);
throw new RuntimeException("Incorrect result");
}
}
Previously I was only running Int64Vector perf test, which hid build issues with other tests. Any more feedback is appreciated.
> Separate from moving the location, but related to mainline integration, I have concerns as to the number of benchmarks generated. It may be we have to curate a default smaller set, but enable the ability for someone to generate more for local testing. We could consider that later on.
Do you have more thoughts here? Which tests should be included versus excluded?
-------------
PR: https://git.openjdk.java.net/panama-vector/pull/77
More information about the panama-dev
mailing list