[aarch64-port-dev ] population count intrinsic performance

Thu Jun 11 16:20:24 UTC 2015

On Thu, 2015-06-11 at 08:10 +0000, Alexeev, Alexander wrote:
> +
> +instruct popCountI(iRegINoSp dst,  iRegIorL2I src, vRegD tmp) %{
> +  match(Set dst (PopCountI src));
> +  effect(TEMP tmp);
> +  ins_cost(INSN_COST * 13);
> +
> +  format %{ "TODO popCountI\n\t" %}
> +  ins_encode %{
> +    __ mov($tmp$$FloatRegister, __ T1D, 0, as_Register($src$$reg));
> +    __ cnt($tmp$$FloatRegister, __ T8B, $tmp$$FloatRegister);
> +    __ addv($tmp$$FloatRegister, __ T8B, $tmp$$FloatRegister);
> +    __ mov(as_Register($dst$$reg), $tmp$$FloatRegister, __ T1D, 0);
> +  %}

I think there may be a problem with the way 'src' is used here. You are
assuming that the top 32 bits of src are 0. However this may not be the
case if for example, src is the result of an elided L2I conversion.

See the following comment in aarch64.ad

// iRegIorL2I is used for src inputs in rules for 32 bit int (I)
// operations. it allows the src to be either an iRegI or a (ConvL2I
// iRegL). in the latter case the l2i normally planted for a ConvL2I
// can be elided because the 32-bit instruction will just employ the
// lower 32 bits anyway.

Now, what I am not clear on, is whether if you just use iRegI here
rather than iRregIorL2I you are guaranteed that the top 32 bits are 0.

All the best,
Ed.