Bounds Check Elimination with Fast-Range

Wed Dec 18 19:51:54 UTC 2019

John,

Apologies for the late response, school has been busy lately.

I do like the idea of growing by a 'size table' or similar, especially if
the growth rate decreases as the hashmap size increases. Since the
probability of fragmentation increases with the number of resizes, bumping
down the resize factor could be a good solution.

To remove the bounds-checking in C2, I've taken a look at the subop class.
I think I understand the change, but I'm wondering about the best way to
handle a negative hash value.

In fast range the hash & size must be less than 2^32, so that the result of
their multiplication fits in a long. So one way is to do hash * (long)
size, or similar. However, there's no unsigned multiply in Java, so if hash
is negative it is widened to a negative long. One could use
Integer::toUnsignedLong, or just directly `hash & 0xffffffffL`. However,
this complicates the code shape. So I wonder if there is a better way. One
option might be to make a Math::fastRange(int hash, int size) method that
is a hotspot intrinsic and directly uses the unsigned instructions. Would
appreciate some guidance on this.

- August

On Tue, Dec 3, 2019 at 12:06 AM John Rose <john.r.rose at oracle.com> wrote:

> On Nov 18, 2019, at 12:17 PM, Florian Weimer <fweimer at redhat.com> wrote:
>
>
> int bucket = fr(h * M); // M = 0x2357BD or something
>
> or maybe something fast and sloppy like:
>
> int bucket = fr(h + (h << 8));
>
>
> Surely this one works, since fr is the final operation.
> The shift/add is just a mixing step to precondition the input.
>
> Just for the record, I’d like to keep brainstorming a bit more,
> though surely this sort of thing is of limited interest.  So,
> just a little more from me on this.
>
> If we had BITR I’d want to try something like fr(h - bitr(h)).
>
> But what I keep wishing for a good one- or two-cycle instruction
> that will mix all input bits into all the output bits, so that any
> change in one input bit is likely to cause cascading changes in
> many output bits, perhaps even 50% on average.  A pair of AES
> steps is a good example of this.  I think AES would be superior to
> multiply (for mixing) when used on 128 bit payloads or larger, so
> it looks appealing (to me) for vectorizable hashing applications.
> Though it is overkill on scalars, I think it points in promising
> directions for scalars also.
>
>
> or even:
>
> int bucket = fr(h) ^ (h & (N-1));
>
> Does this really work?  I don't think so.
>
>
> Oops, you are right.  Something like it might work, though.
>
> The idea, on paper, is that h & (N-1) is less than N, for any N >=1.
> And if N-1 has a high enough pop-count the information content is close to
> 50% of h (though maybe 50% of the bits are masked out).  The xor of two
> quasi-independent values both less than N is, well, less than 2^(ceil lg
> N),
> not N, which is a bug.  Oops.  There are ways to quickly combine two values
> less than N and reduce the result to less than N:  You do a conditional
> subtract of N if the sum is >= N.
>
> So the tactical area I’m trying to explore here is to take two reduced
> hashes developed in parallel, which depend on different bits of the
> input, and combine them into a single stronger hash (not by ^).
>
> Maybe (I can’t resist hacking at it some more):
>
> int h1 = H1(h), h2 = H2(h);
> int bucket = CCA(h1 - h2, N);
> // where H1 := fr, H2(h) := (h & (N-1))
> // where CCA(x, N) := x + ((x >> 31) & N) // Conditional Compensating Add
>
> In this framework, a better H2 for favoring the low bits of h might be
> H2(h) := ((1<<L)-1) & h, where L = (ceil lg N)-1.  This arguably maximizes
> the number of low bits of h that feed into the final bucket selection,
> while fr (H1) arguably maximizes the number of influential high bits.
>
> I think this kind of perturbation is quite expensive.  Arm's BITR should
> be helpful here.
>
>
> Yes, BITR is a helpful building block.  If I understand you correctly, it
> needs to be combined with other operations, such as multiply, shift, xor,
> etc.,
> and can overcome biases towards high bits or towards low bits that come
> from the simple arithmetic definitions of the other mixing operations.
>
> The hack with CCA(h1 - h2, N) seems competitive with a BITR-based
> mixing step, since H2 can be very simple.
>
> A scalar variant of two AES steps (with xor of a second register or
> constant
> parameter at both stages) would be a better building block for strongly
> mixing bits.
> Or some other shallow linear network with a layer of non-linear S-boxes.
>
> But even though this operation is commonly needed and
> easily implemented in hardware, it's rarely found in CPUs.
>
>
> Yep; the whole cottage industry of building clever mixing functions
> out of hand calculator functions could be sidelined if CPUs gave us good
> cheap mixing primitives out of the box.  The crypto literature is full of
> them, and many are designed to be easy to implement in silicon.
>
> — John
>
> P.S. I mention AES because I’m familiar with that bit of crypto tech, and
> also because I actually tried it out once on the smhasher quality
> benchmark.
> No surprise in hindsight; it passes the quality tests with just two rounds.
> Given that it is as cheap as multiplication, and handles twice as many
> bits at a time, but requires two steps for full mixing, it would seem to
> be competitive with multiplication as a mixing step.  It has no built-in
> biases towards high or low bits, so that’s an advantage over
> multiplication.
>
> Why two rounds?  The one-round version has flaws, as a hash function,
> which are obvious on inspection of the simple structure of an AES round.
> Not every output bit is data-dependent on every input bit of one round,
> but two rounds swirls them all together.  Are back-to-back AES rounds
> expensive?  Maybe, although that’s how the instructions are designed to
> be used, about 10 of them back to back to do real crypto.
>
>