Bounds Check Elimination with Fast-Range

Tue Dec 3 06:06:39 UTC 2019

On Nov 18, 2019, at 12:17 PM, Florian Weimer <fweimer at redhat.com> wrote:
> 
>> int bucket = fr(h * M); // M = 0x2357BD or something
>> 
>> or maybe something fast and sloppy like:
>> 
>> int bucket = fr(h + (h << 8));

Surely this one works, since fr is the final operation.
The shift/add is just a mixing step to precondition the input.

Just for the record, I’d like to keep brainstorming a bit more,
though surely this sort of thing is of limited interest.  So,
just a little more from me on this.

If we had BITR I’d want to try something like fr(h - bitr(h)).

But what I keep wishing for a good one- or two-cycle instruction
that will mix all input bits into all the output bits, so that any
change in one input bit is likely to cause cascading changes in
many output bits, perhaps even 50% on average.  A pair of AES
steps is a good example of this.  I think AES would be superior to
multiply (for mixing) when used on 128 bit payloads or larger, so
it looks appealing (to me) for vectorizable hashing applications.
Though it is overkill on scalars, I think it points in promising
directions for scalars also.

>> 
>> or even:
>> 
>> int bucket = fr(h) ^ (h & (N-1));
> Does this really work?  I don't think so.

Oops, you are right.  Something like it might work, though.

The idea, on paper, is that h & (N-1) is less than N, for any N >=1.
And if N-1 has a high enough pop-count the information content is close to
50% of h (though maybe 50% of the bits are masked out).  The xor of two
quasi-independent values both less than N is, well, less than 2^(ceil lg N),
not N, which is a bug.  Oops.  There are ways to quickly combine two values
less than N and reduce the result to less than N:  You do a conditional
subtract of N if the sum is >= N.

So the tactical area I’m trying to explore here is to take two reduced
hashes developed in parallel, which depend on different bits of the
input, and combine them into a single stronger hash (not by ^).

Maybe (I can’t resist hacking at it some more):

int h1 = H1(h), h2 = H2(h);
int bucket = CCA(h1 - h2, N);
// where H1 := fr, H2(h) := (h & (N-1))
// where CCA(x, N) := x + ((x >> 31) & N) // Conditional Compensating Add

In this framework, a better H2 for favoring the low bits of h might be
H2(h) := ((1<<L)-1) & h, where L = (ceil lg N)-1.  This arguably maximizes
the number of low bits of h that feed into the final bucket selection,
while fr (H1) arguably maximizes the number of influential high bits.

> I think this kind of perturbation is quite expensive.  Arm's BITR should
> be helpful here.

Yes, BITR is a helpful building block.  If I understand you correctly, it
needs to be combined with other operations, such as multiply, shift, xor, etc.,
and can overcome biases towards high bits or towards low bits that come
from the simple arithmetic definitions of the other mixing operations.

The hack with CCA(h1 - h2, N) seems competitive with a BITR-based
mixing step, since H2 can be very simple.

A scalar variant of two AES steps (with xor of a second register or constant
parameter at both stages) would be a better building block for strongly mixing bits.
Or some other shallow linear network with a layer of non-linear S-boxes.

> But even though this operation is commonly needed and
> easily implemented in hardware, it's rarely found in CPUs.

Yep; the whole cottage industry of building clever mixing functions
out of hand calculator functions could be sidelined if CPUs gave us good
cheap mixing primitives out of the box.  The crypto literature is full of
them, and many are designed to be easy to implement in silicon.

— John

P.S. I mention AES because I’m familiar with that bit of crypto tech, and
also because I actually tried it out once on the smhasher quality benchmark.
No surprise in hindsight; it passes the quality tests with just two rounds.
Given that it is as cheap as multiplication, and handles twice as many
bits at a time, but requires two steps for full mixing, it would seem to
be competitive with multiplication as a mixing step.  It has no built-in
biases towards high or low bits, so that’s an advantage over multiplication.

Why two rounds?  The one-round version has flaws, as a hash function,
which are obvious on inspection of the simple structure of an AES round.
Not every output bit is data-dependent on every input bit of one round,
but two rounds swirls them all together.  Are back-to-back AES rounds
expensive?  Maybe, although that’s how the instructions are designed to
be used, about 10 of them back to back to do real crypto.