Optimized version of CopiesList.hashCode()

Fri Nov 30 07:36:02 UTC 2018

>  If n == 1, then it would become `mask = n << 32`, and the loop would
run 32 times.
Forget my implementation. It is incorrect at all. In only works for odd
numbers :(

пт, 30 нояб. 2018 г. в 13:58, Ivan Gerasimov <ivan.gerasimov at oracle.com>:

> Hi Zheka and Tagir!
>
>
> On 11/29/18 10:37 PM, Zheka Kozlov wrote:
> > Thanks, Tagir!
> >
> > I was also thinking of how to calculate hashCode quickly but my direction
> > was wrong. I thought that we can use the formula of the sum of a
> geometric
> > progression: Sum(p^k, k = 0..n) = (1-p^n)/(1-p). Unfortunately, this
> > involves division which doesn't work with the overflow of integers. I
> > didn't know the trick p^(2n) = (p^n)^2.
> >
> > Your formulas and implementation look correct.
> >
> > I would probably rewrite the loop to make it a bit simpler:
> > for (int mask = n << (Integer.numberOfLeadingZeros(n) + 1); mask != 0;
> mask
> > <<= 1) {
>
> If n == 1, then it would become `mask = n << 32`, and the loop would run
> 32 times.
>
>
> The check
>      if (((n << i) & 0x8000_0000) != 0) {
> might be written as
>      if ((n << i) < 0 ) {
> to save one bit-wise operation and avoid using extra constant.
>
>
> With kind regards,
> Ivan
>
> >      sum = sum * (pow + 1);
> >      pow *= pow;
> >      if ((mask & 0x8000_0000) != 0) {
> >          pow *= 31;
> >          sum = sum * 31 + 1;
> >      }
> > }
> >
> > But that's just a matter of style.
> >
> > Great job!
> >
> > пт, 30 нояб. 2018 г. в 11:02, Tagir Valeev <amaembo at gmail.com>:
> >
> >> Hello!
> >>
> >> If you are doing it fast, why not doing it really fast? If you
> >> deparenthesize and regroup terms, you'll got
> >> h(e, n) = p ^ n + e * f(n)
> >> Where h(e, n) is the hashCode of n elements with hashCode of single
> >> element = e; p = 31 and
> >> f(n) = Sum(p^k, k = 0..n-1)
> >>
> >> Using simple algebraic rules, you'll get:
> >> p ^ (2n) = (p^n)^2
> >> f(2n) = f(n) * (p^n + 1)
> >> p ^ (n+1) = (p^n)*p
> >> f(n+1) = f(n) * p + 1
> >>
> >> Thus the algorithm may look as follows:
> >>
> >> public int hashCode() {
> >>      int pow = 1; // -> 31^n
> >>      int sum = 0; // -> Sum(31^k, k = 0..n-1)
> >>      for (int i = Integer.numberOfLeadingZeros(n); i < Integer.SIZE;
> i++) {
> >>          sum = sum * (pow + 1);
> >>          pow *= pow;
> >>          if (((n << i) & 0x8000_0000) != 0) {
> >>              pow *= 31;
> >>              sum = sum * 31 + 1;
> >>          }
> >>      }
> >>      return pow + sum * (element == null ? 0 : element.hashCode());
> >> }
> >>
> >> It seems reasonable to peel off the n = 0 case, which helps to reduce
> >> number of multiplications for other cases (also for n = 0 we don't
> >> need to calculate element hashCode at all):
> >>
> >> public int hashCode() {
> >>      if (n == 0) return 1;
> >>      int pow = 31; // -> 31^n
> >>      int sum = 1; // -> Sum(31^k, k = 0..n-1)
> >>      for (int i = Integer.numberOfLeadingZeros(n) + 1; i < Integer.SIZE;
> >> i++) {
> >>          sum = sum * (pow + 1);
> >>          pow *= pow;
> >>          if (((n << i) & 0x8000_0000) != 0) {
> >>              pow *= 31;
> >>              sum = sum * 31 + 1;
> >>          }
> >>      }
> >>      return pow + sum * (element == null ? 0 : element.hashCode());
> >> }
> >>
> >> Assuming that element hashCode is simple (I used String "foo" as an
> >> element), I got the following results for different collection sizes:
> >>
> >> Benchmark         (size)         Score        Error  Units
> >> hashCodeFast           0         2,299 ±      0,017  ns/op
> >> hashCodeFast           1         2,731 ±      0,021  ns/op
> >> hashCodeFast           2         4,073 ±      0,077  ns/op
> >> hashCodeFast           3         4,315 ±      0,032  ns/op
> >> hashCodeFast           5         5,470 ±      0,074  ns/op
> >> hashCodeFast          10         6,904 ±      0,060  ns/op
> >> hashCodeFast          30         9,102 ±      0,173  ns/op
> >> hashCodeFast         100        10,093 ±      0,069  ns/op
> >> hashCodeFast        1000        14,129 ±      0,074  ns/op
> >> hashCodeFast       10000        17,028 ±      0,249  ns/op
> >> hashCodeFast      100000        20,795 ±      0,194  ns/op
> >> hashCodeFast     1000000        23,622 ±      0,264  ns/op
> >>
> >> Compared to Zheka's implementation:
> >>
> >> Benchmark         (size)         Score        Error  Units
> >> hashCodeZheka          0         2,584 ±      0,024  ns/op
> >> hashCodeZheka          1         2,868 ±      0,022  ns/op
> >> hashCodeZheka          2         3,730 ±      0,030  ns/op
> >> hashCodeZheka          3         4,323 ±      0,027  ns/op
> >> hashCodeZheka          5         5,285 ±      0,037  ns/op
> >> hashCodeZheka         10         8,254 ±      0,057  ns/op
> >> hashCodeZheka         30        24,793 ±      0,218  ns/op
> >> hashCodeZheka        100        89,017 ±      0,764  ns/op
> >> hashCodeZheka       1000       923,792 ±     28,194  ns/op
> >> hashCodeZheka      10000      9157,411 ±     98,902  ns/op
> >> hashCodeZheka     100000     91705,599 ±    689,299  ns/op
> >> hashCodeZheka    1000000    919723,545 ±  13092,935  ns/op
> >>
> >> So results are quite similar for one-digit counts, but we start
> >> winning from n = 10 and after that logarithmic algorithm really rocks.
> >>
> >> I can file an issue and create a webrev, but I still need a sponsor
> >> and review for such change. Martin, can you help with this?
> >>
> >> With best regards,
> >> Tagir Valeev.
> >> On Tue, Nov 27, 2018 at 5:49 PM Martin Buchholz <martinrb at google.com>
> >> wrote:
> >>> I agree!
> >>>
> >>> (but don't have time ...)
> >>>
> >>> On Sun, Nov 25, 2018 at 9:01 PM, Zheka Kozlov <orionllmain at gmail.com>
> >> wrote:
> >>>> Currently, CopiesList.hashCode() is inherited from AbstractList which:
> >>>>
> >>>>     - calls hashCode() for each element,
> >>>>     - creates a new Iterator every time.
> >>>>
> >>>> However, for Collections.nCopies():
> >>>>
> >>>>     - All elements are the same. So hashCode() can be called only
> once.
> >>>>     - An Iterator is unnecessary.
> >>>>
> >>>> So, I propose overridding hashCode() implementation for CopiesList:
> >>>>
> >>>> @Override
> >>>> public int hashCode() {
> >>>>      int hashCode = 1;
> >>>>      final int elementHashCode = (element == null) ? 0 :
> >> element.hashCode();
> >>>>      for (int i = 0; i < n; i++) {
> >>>>          hashCode = 31*hashCode + elementHashCode;
> >>>>      }
> >>>>      return hashCode;
> >>>> }
> >>>>
> >>>> Benchmark:
> >>>> List<List<String>> list = Collections.nCopies(10_000, new
> >>>> ArrayList<>(Collections.nCopies(1_000_000, "a")));
> >>>> long nano = System.nanoTime();
> >>>> System.out.println(list.hashCode());
> >>>> System.out.println((System.nanoTime() - nano) / 1_000_000);
> >>>>
> >>>> Result:
> >>>> Old version - ~12 seconds.
> >>>> New version - ~10 milliseconds.
> >>>>
>
> --
> With kind regards,
> Ivan Gerasimov
>
>