Optimized version of CopiesList.hashCode()

Fri Nov 30 06:37:40 UTC 2018

Thanks, Tagir!

I was also thinking of how to calculate hashCode quickly but my direction
was wrong. I thought that we can use the formula of the sum of a geometric
progression: Sum(p^k, k = 0..n) = (1-p^n)/(1-p). Unfortunately, this
involves division which doesn't work with the overflow of integers. I
didn't know the trick p^(2n) = (p^n)^2.

Your formulas and implementation look correct.

I would probably rewrite the loop to make it a bit simpler:
for (int mask = n << (Integer.numberOfLeadingZeros(n) + 1); mask != 0; mask
<<= 1) {
    sum = sum * (pow + 1);
    pow *= pow;
    if ((mask & 0x8000_0000) != 0) {
        pow *= 31;
        sum = sum * 31 + 1;
    }
}

But that's just a matter of style.

Great job!

пт, 30 нояб. 2018 г. в 11:02, Tagir Valeev <amaembo at gmail.com>:

> Hello!
>
> If you are doing it fast, why not doing it really fast? If you
> deparenthesize and regroup terms, you'll got
> h(e, n) = p ^ n + e * f(n)
> Where h(e, n) is the hashCode of n elements with hashCode of single
> element = e; p = 31 and
> f(n) = Sum(p^k, k = 0..n-1)
>
> Using simple algebraic rules, you'll get:
> p ^ (2n) = (p^n)^2
> f(2n) = f(n) * (p^n + 1)
> p ^ (n+1) = (p^n)*p
> f(n+1) = f(n) * p + 1
>
> Thus the algorithm may look as follows:
>
> public int hashCode() {
>     int pow = 1; // -> 31^n
>     int sum = 0; // -> Sum(31^k, k = 0..n-1)
>     for (int i = Integer.numberOfLeadingZeros(n); i < Integer.SIZE; i++) {
>         sum = sum * (pow + 1);
>         pow *= pow;
>         if (((n << i) & 0x8000_0000) != 0) {
>             pow *= 31;
>             sum = sum * 31 + 1;
>         }
>     }
>     return pow + sum * (element == null ? 0 : element.hashCode());
> }
>
> It seems reasonable to peel off the n = 0 case, which helps to reduce
> number of multiplications for other cases (also for n = 0 we don't
> need to calculate element hashCode at all):
>
> public int hashCode() {
>     if (n == 0) return 1;
>     int pow = 31; // -> 31^n
>     int sum = 1; // -> Sum(31^k, k = 0..n-1)
>     for (int i = Integer.numberOfLeadingZeros(n) + 1; i < Integer.SIZE;
> i++) {
>         sum = sum * (pow + 1);
>         pow *= pow;
>         if (((n << i) & 0x8000_0000) != 0) {
>             pow *= 31;
>             sum = sum * 31 + 1;
>         }
>     }
>     return pow + sum * (element == null ? 0 : element.hashCode());
> }
>
> Assuming that element hashCode is simple (I used String "foo" as an
> element), I got the following results for different collection sizes:
>
> Benchmark         (size)         Score        Error  Units
> hashCodeFast           0         2,299 ±      0,017  ns/op
> hashCodeFast           1         2,731 ±      0,021  ns/op
> hashCodeFast           2         4,073 ±      0,077  ns/op
> hashCodeFast           3         4,315 ±      0,032  ns/op
> hashCodeFast           5         5,470 ±      0,074  ns/op
> hashCodeFast          10         6,904 ±      0,060  ns/op
> hashCodeFast          30         9,102 ±      0,173  ns/op
> hashCodeFast         100        10,093 ±      0,069  ns/op
> hashCodeFast        1000        14,129 ±      0,074  ns/op
> hashCodeFast       10000        17,028 ±      0,249  ns/op
> hashCodeFast      100000        20,795 ±      0,194  ns/op
> hashCodeFast     1000000        23,622 ±      0,264  ns/op
>
> Compared to Zheka's implementation:
>
> Benchmark         (size)         Score        Error  Units
> hashCodeZheka          0         2,584 ±      0,024  ns/op
> hashCodeZheka          1         2,868 ±      0,022  ns/op
> hashCodeZheka          2         3,730 ±      0,030  ns/op
> hashCodeZheka          3         4,323 ±      0,027  ns/op
> hashCodeZheka          5         5,285 ±      0,037  ns/op
> hashCodeZheka         10         8,254 ±      0,057  ns/op
> hashCodeZheka         30        24,793 ±      0,218  ns/op
> hashCodeZheka        100        89,017 ±      0,764  ns/op
> hashCodeZheka       1000       923,792 ±     28,194  ns/op
> hashCodeZheka      10000      9157,411 ±     98,902  ns/op
> hashCodeZheka     100000     91705,599 ±    689,299  ns/op
> hashCodeZheka    1000000    919723,545 ±  13092,935  ns/op
>
> So results are quite similar for one-digit counts, but we start
> winning from n = 10 and after that logarithmic algorithm really rocks.
>
> I can file an issue and create a webrev, but I still need a sponsor
> and review for such change. Martin, can you help with this?
>
> With best regards,
> Tagir Valeev.
> On Tue, Nov 27, 2018 at 5:49 PM Martin Buchholz <martinrb at google.com>
> wrote:
> >
> > I agree!
> >
> > (but don't have time ...)
> >
> > On Sun, Nov 25, 2018 at 9:01 PM, Zheka Kozlov <orionllmain at gmail.com>
> wrote:
> >
> > > Currently, CopiesList.hashCode() is inherited from AbstractList which:
> > >
> > >    - calls hashCode() for each element,
> > >    - creates a new Iterator every time.
> > >
> > > However, for Collections.nCopies():
> > >
> > >    - All elements are the same. So hashCode() can be called only once.
> > >    - An Iterator is unnecessary.
> > >
> > > So, I propose overridding hashCode() implementation for CopiesList:
> > >
> > > @Override
> > > public int hashCode() {
> > >     int hashCode = 1;
> > >     final int elementHashCode = (element == null) ? 0 :
> element.hashCode();
> > >     for (int i = 0; i < n; i++) {
> > >         hashCode = 31*hashCode + elementHashCode;
> > >     }
> > >     return hashCode;
> > > }
> > >
> > > Benchmark:
> > > List<List<String>> list = Collections.nCopies(10_000, new
> > > ArrayList<>(Collections.nCopies(1_000_000, "a")));
> > > long nano = System.nanoTime();
> > > System.out.println(list.hashCode());
> > > System.out.println((System.nanoTime() - nano) / 1_000_000);
> > >
> > > Result:
> > > Old version - ~12 seconds.
> > > New version - ~10 milliseconds.
> > >
>