RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]
Jatin Bhateja
jbhateja at openjdk.java.net
Fri Feb 12 06:01:37 UTC 2021
On Thu, 11 Feb 2021 14:28:01 GMT, Claes Redestad <redestad at openjdk.org> wrote:
> > Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>
> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
BASELINE:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
61.037 ns/op
Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
Perf stats:
--------------------------------------------------
19,739.21 msec task-clock # 0.389 CPUs utilized
646 context-switches # 0.033 K/sec
12 cpu-migrations # 0.001 K/sec
150 page-faults # 0.008 K/sec
74,59,83,59,139 cycles # 3.779 GHz (30.73%)
1,78,78,79,19,117 instructions # 2.40 insn per cycle (38.48%)
24,79,81,63,651 branches # 1256.289 M/sec (38.55%)
32,24,89,924 branch-misses # 1.30% of all branches (38.62%)
52,56,88,28,472 L1-dcache-loads # 2663.167 M/sec (38.65%)
39,00,969 L1-dcache-load-misses # 0.01% of all L1-dcache hits (38.57%)
3,74,131 LLC-loads # 0.019 M/sec (30.77%)
22,315 LLC-load-misses # 5.96% of all LL-cache hits (30.72%)
<not supported> L1-icache-loads
17,49,997 L1-icache-load-misses (30.72%)
52,91,41,70,636 dTLB-loads # 2680.663 M/sec (30.69%)
3,315 dTLB-load-misses # 0.00% of all dTLB cache hits (30.67%)
4,674 iTLB-loads # 0.237 K/sec (30.65%)
33,746 iTLB-load-misses # 721.99% of all iTLB cache hits (30.63%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
50.723759146 seconds time elapsed
51.447054000 seconds user
0.189949000 seconds sys
WITH OPT:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
74.356 ns/op
Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
Perf stats:
--------------------------------------------------
19,741.09 msec task-clock # 0.389 CPUs utilized
641 context-switches # 0.032 K/sec
17 cpu-migrations # 0.001 K/sec
164 page-faults # 0.008 K/sec
74,40,40,48,513 cycles # 3.769 GHz (30.81%)
1,45,66,22,06,797 instructions # 1.96 insn per cycle (38.56%)
20,31,28,43,577 branches # 1028.963 M/sec (38.65%)
14,11,419 branch-misses # 0.01% of all branches (38.69%)
43,07,86,33,662 L1-dcache-loads # 2182.182 M/sec (38.72%)
37,06,744 L1-dcache-load-misses # 0.01% of all L1-dcache hits (38.56%)
1,34,292 LLC-loads # 0.007 M/sec (30.72%)
30,627 LLC-load-misses # 22.81% of all LL-cache hits (30.68%)
<not supported> L1-icache-loads
14,49,145 L1-icache-load-misses (30.65%)
43,44,86,27,516 dTLB-loads # 2200.924 M/sec (30.63%)
218 dTLB-load-misses # 0.00% of all dTLB cache hits (30.63%)
2,445 iTLB-loads # 0.124 K/sec (30.63%)
28,624 iTLB-load-misses # 1170.72% of all iTLB cache hits (30.63%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
50.716083931 seconds time elapsed
51.467300000 seconds user
0.200390000 seconds sys
JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.
But, there is a significant reduction in instruction count and cycles are almost comparable. We are saving one shift per mask computation.
OLD Sequence:
0x00007f7fc1030ead: movabs $0x1,%rax
0x00007f7fc1030eb7: shlx %r8,%rax,%rax
0x00007f7fc1030ebc: dec %rax
0x00007f7fc1030ebf: kmovq %rax,%k2
NEW Sequence:
0x00007f775d030d51: movabs $0xffffffffffffffff,%rax
0x00007f775d030d5b: bzhi %r8,%rax,%rax
0x00007f775d030d60: kmovq %rax,%k2
-------------
PR: https://git.openjdk.java.net/jdk/pull/2522
More information about the hotspot-compiler-dev
mailing list