[aarch64-port-dev ] population count intrinsic performance
Alexeev, Alexander
Alexander.Alexeev at caviumnetworks.com
Wed Jun 10 14:06:03 UTC 2015
Hello
I've implemented preliminary version of popCountI (intrinsic for java.lang.Integer.bitCount).
For some reasons performance become worse than it was before with Hacker's Delight version of algorithm. Pure benchmarking of assembly code showed that new version in contrast should be more efficient (2 cycles shorter).
SIMD - 13 cycles
HD (baseline) - 15 cycles
For evaluation in Java I used JMH
Benchmark Mode Cnt Score Error Units
SIMD BitCount.bitCountInteger avgt 5 16.008 ? 0.016 ns/op
Baseline BitCount.bitCountInteger avgt 5 11.131 ? 0.069 ns/op
So I wonder what could cause such reverse. Could the reason be in JVM infrastructure and how intrinsics are inlined versus JITed code?
Any ideas are appreciated?
instruct popCountI(iRegINoSp dst, iRegIorL2I src) %{
match(Set dst (PopCountI src));
ins_cost(INSN_COST * 13);
format %{ "popCountI TODO\n\t" %}
ins_encode %{
__ mov(vscratch1, __ T1D, 0, as_Register($src$$reg));
__ cnt(vscratch2, __ T8B, vscratch1);
__ addv(vscratch1, __ T8B, vscratch2);
__ mov(as_Register($dst$$reg), vscratch1, __ T1D, 0);
%}
ins_pipe(ialu_reg);
%}
Benchmark JMH (just one routine, the rest is as usual)
@Benchmark
public void bitCountInteger(final Blackhole bh) {
bh.consume(java.lang.Integer.bitCount(0));
}
Thanks,
Alexander
More information about the aarch64-port-dev
mailing list