RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v3]
Jatin Bhateja
jbhateja at openjdk.java.net
Mon Feb 15 18:55:03 UTC 2021
On Fri, 12 Feb 2021 05:59:01 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>>> Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>>
>> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
>
>> > Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>>
>> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
>
> BASELINE:
> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
> 61.037 ns/op
>
> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
> Perf stats:
> --------------------------------------------------
>
> 19,739.21 msec task-clock # 0.389 CPUs utilized
> 646 context-switches # 0.033 K/sec
> 12 cpu-migrations # 0.001 K/sec
> 150 page-faults # 0.008 K/sec
> 74,59,83,59,139 cycles # 3.779 GHz (30.73%)
> 1,78,78,79,19,117 instructions # 2.40 insn per cycle (38.48%)
> 24,79,81,63,651 branches # 1256.289 M/sec (38.55%)
> 32,24,89,924 branch-misses # 1.30% of all branches (38.62%)
> 52,56,88,28,472 L1-dcache-loads # 2663.167 M/sec (38.65%)
> 39,00,969 L1-dcache-load-misses # 0.01% of all L1-dcache hits (38.57%)
> 3,74,131 LLC-loads # 0.019 M/sec (30.77%)
> 22,315 LLC-load-misses # 5.96% of all LL-cache hits (30.72%)
> <not supported> L1-icache-loads
> 17,49,997 L1-icache-load-misses (30.72%)
> 52,91,41,70,636 dTLB-loads # 2680.663 M/sec (30.69%)
> 3,315 dTLB-load-misses # 0.00% of all dTLB cache hits (30.67%)
> 4,674 iTLB-loads # 0.237 K/sec (30.65%)
> 33,746 iTLB-load-misses # 721.99% of all iTLB cache hits (30.63%)
> <not supported> L1-dcache-prefetches
> <not supported> L1-dcache-prefetch-misses
>
> 50.723759146 seconds time elapsed
>
> 51.447054000 seconds user
> 0.189949000 seconds sys
>
>
> WITH OPT:
> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
> 74.356 ns/op
>
> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
> Perf stats:
> --------------------------------------------------
>
> 19,741.09 msec task-clock # 0.389 CPUs utilized
> 641 context-switches # 0.032 K/sec
> 17 cpu-migrations # 0.001 K/sec
> 164 page-faults # 0.008 K/sec
> 74,40,40,48,513 cycles # 3.769 GHz (30.81%)
> 1,45,66,22,06,797 instructions # 1.96 insn per cycle (38.56%)
> 20,31,28,43,577 branches # 1028.963 M/sec (38.65%)
> 14,11,419 branch-misses # 0.01% of all branches (38.69%)
> 43,07,86,33,662 L1-dcache-loads # 2182.182 M/sec (38.72%)
> 37,06,744 L1-dcache-load-misses # 0.01% of all L1-dcache hits (38.56%)
> 1,34,292 LLC-loads # 0.007 M/sec (30.72%)
> 30,627 LLC-load-misses # 22.81% of all LL-cache hits (30.68%)
> <not supported> L1-icache-loads
> 14,49,145 L1-icache-load-misses (30.65%)
> 43,44,86,27,516 dTLB-loads # 2200.924 M/sec (30.63%)
> 218 dTLB-load-misses # 0.00% of all dTLB cache hits (30.63%)
> 2,445 iTLB-loads # 0.124 K/sec (30.63%)
> 28,624 iTLB-load-misses # 1170.72% of all iTLB cache hits (30.63%)
> <not supported> L1-dcache-prefetches
> <not supported> L1-dcache-prefetch-misses
>
> 50.716083931 seconds time elapsed
>
> 51.467300000 seconds user
> 0.200390000 seconds sys
>
>
> JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.
>
> But, there is a significant reduction in instruction count and cycles are almost comparable. We are saving one shift per mask computation.
>
> OLD Sequence:
> 0x00007f7fc1030ead: movabs $0x1,%rax
> 0x00007f7fc1030eb7: shlx %r8,%rax,%rax
> 0x00007f7fc1030ebc: dec %rax
> 0x00007f7fc1030ebf: kmovq %rax,%k2
> NEW Sequence:
> 0x00007f775d030d51: movabs $0xffffffffffffffff,%rax
> 0x00007f775d030d5b: bzhi %r8,%rax,%rax
> 0x00007f775d030d60: kmovq %rax,%k2
Further analysis of perf degradation revealed that with new optimized instruction pattern, code alignment got disturbed. This led to increase in LSD misses, also it reduced the UOPs cashing in DSB.
Aligning copy loops at 32 byte boundary prevents any adverse impact on UOP caching.
NOPs used for padding add up to the instruction count and thus may over shadow the code size gains due to new mask generation sequence in copy stubs.
Baseline:
ArrayCopyAligned.testLong Length : 1200 61 ns/op (approx)
1,93,44,43,11,622 cycles
4,59,57,99,78,727 instructions # 2.38 insn per cycle
1,83,68,75,68,255 idq.dsb_uops
2,08,32,43,71,906 lsd.uops
37,12,54,60,211 idq.mite_uops
With Opt:
ArrayCopyAligned.testLong Length : 1200 74 ns/op (approx)
1,93,51,25,94,766 cycles
3,75,11,57,91,917 instructions # 1.94 insn per cycle
48,67,58,25,566 idq.dsb_uops
19,46,13,236 lsd.uops
2,87,42,95,74,280 idq.mite_uops
With Opt + main loop alignment(nop): 61 ns/op (approx)
ArrayCopyAligned.testLong Length : 1200
1,93,52,15,90,080 cycles
4,60,89,14,06,528 instructions # 2.38 insn per cycle
1,78,76,10,34,991 idq.dsb_uops
2,09,16,15,84,313 lsd.uops
46,25,31,92,101 idq.mite_uops
While computing the mask for partial in-lining of small copy calls ( currently enabled for sub-word types with copy length less than 32/64 bytes), new optimized sequence should always offer lower instruction count and latency path.
Baseline:
ArrayCopyAligned.testByte Length : 20 avgt 2 2.635 ns/op
1,97,76,75,18,052 cycles
8,96,00,37,11,803 instructions # 4.53 insn per cycle
2,71,83,79,035 idq.dsb_uops
7,54,82,43,63,409 lsd.uops
3,92,55,74,395 idq.mite_uops
ArrayCopyAligned.testByte Length : 31 avgt 2 2.635 ns/op
1,97,79,16,56,787 cycles
8,96,13,15,69,780 instructions # 4.53 insn per cycle
2,69,07,11,691 idq.dsb_uops
7,54,95,63,77,683 lsd.uops
3,90,19,10,747 idq.mite_uops
WithOpt:
ArrayCopyAligned.testByte Length : 20 avgt 2 2.635 ns/op
1,97,66,64,62,541 cycles
8,92,03,95,00,236 instructions # 4.51 insn per cycle
2,72,38,56,205 idq.dsb_uops
7,50,87,50,60,591 lsd.uops
3,89,15,02,954 idq.mite_uops
ArrayCopyAligned.testByte Length : 31 avgt 2 2.635 ns/op
1,97,54,21,61,110 cycles
8,91,46,64,23,754 instructions # 4.51 insn per cycle
2,78,12,19,544 idq.dsb_uops
7,50,35,88,95,843 lsd.uops
3,90,41,97,276 idq.mite_uops
Following are the links to updated JMH perf data:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS_LOOP_ALIGN.txt
In general gains are not significant in case of copy stubs, but new sequence offers a optimal latency path for mask computation sequence.
-------------
PR: https://git.openjdk.java.net/jdk/pull/2522
More information about the hotspot-compiler-dev
mailing list