[aarch64-port-dev ] population count intrinsic performance

Wed Jun 10 14:06:03 UTC 2015

Hello

I've implemented preliminary version of popCountI (intrinsic for java.lang.Integer.bitCount).
For some reasons performance become worse than it was before with Hacker's Delight version of algorithm. Pure benchmarking of assembly code showed that new version in contrast should be more efficient (2 cycles shorter).
SIMD - 13 cycles
HD  (baseline)  - 15 cycles

For evaluation in Java I used JMH

                                 Benchmark                 Mode  Cnt   Score   Error  Units
SIMD         BitCount.bitCountInteger  avgt    5  16.008 ? 0.016  ns/op
Baseline   BitCount.bitCountInteger  avgt    5  11.131 ? 0.069  ns/op

So I wonder what could cause such reverse. Could the reason be in JVM infrastructure and how intrinsics are inlined versus JITed code?
Any ideas are appreciated?

instruct popCountI(iRegINoSp dst,  iRegIorL2I src) %{
  match(Set dst (PopCountI src));
  ins_cost(INSN_COST * 13);

  format %{ "popCountI TODO\n\t" %}
  ins_encode %{
      __ mov(vscratch1, __ T1D, 0, as_Register($src$$reg));
      __ cnt(vscratch2, __ T8B, vscratch1);
      __ addv(vscratch1, __ T8B, vscratch2);
      __ mov(as_Register($dst$$reg), vscratch1, __ T1D, 0);
  %}

  ins_pipe(ialu_reg);
%}

Benchmark JMH (just one routine, the rest is as usual)

    @Benchmark
    public void bitCountInteger(final Blackhole bh) {
        bh.consume(java.lang.Integer.bitCount(0));
    }

Thanks,
Alexander