RFR (XS) CR 8058643: (str) Re-examine hashCode implementation

Thu Sep 25 11:09:13 UTC 2014

On 09/25/2014 09:40 AM, Aleksey Shipilev wrote:
> Hi Peter,
>
> On 09/25/2014 02:46 AM, Peter Levart wrote:
>>>> http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench.java
> Interesting.
>
> I have to say it once again:
>   a) It is *an error* to use static finals as the benchmark input. They
> are perfectly constant-foldable in way too many cases. Break this habit,
> please.

Hi Aleksey,

The "constant" in this example is only a reference to the char[] array. 
It's content is not. In String, this is a final instance field, which 
behaves similarly inside an instance method (usually it is read only 
once per method invocation).

>   b) Explicit Blackholes are not needed, and just returning "int" from
> @Benchmark method helps readability a lot. Please break this habit as
> well. Having readable and maintainable benchmarks is a key.

Ok, here's a modified benchmark:

http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench2.java

Which behaves similarly.

Here are it's results:

  * Results on JDK9, Linux, i7-2600K CPU, JMH args: -f 1 -wi 5 -i 8 -gc true
  *
  * Benchmark                     Mode  Samples         Score  Score 
error  Units
  * j.t.HashBench2._hashCode     thrpt        8   8308858.217 
353019.084  ops/s
  * j.t.HashBench2.hashCode0     thrpt        8   8207337.729 
217048.634  ops/s
  * j.t.HashBench2.hashCode1     thrpt        8  13359572.359 
345736.675  ops/s
  * j.t.HashBench2.hashCode2     thrpt        8  15310621.202 
238369.017  ops/s
  * j.t.HashBench2.hashCode3     thrpt        8  17637944.829 
232155.847  ops/s
  * j.t.HashBench2.hashCode3i    thrpt        8  17724181.444 
509913.288  ops/s
  * j.t.HashBench2.hashCode3x    thrpt        8   8344128.432 
159508.813  ops/s
  * j.t.HashBench2.hashCode4     thrpt        8  16526850.489 
969549.448  ops/s
  * j.t.HashBench2.hashCode5     thrpt        8  17567765.554 
917934.885  ops/s
  * j.t.HashBench2.hashCode6     thrpt        8  17705074.332 
419405.652  ops/s
  * j.t.HashBench2.hashCode7     thrpt        8  18805633.563 
209181.299  ops/s
  * j.t.HashBench2.hashCode8     thrpt        8  18300123.201 
376681.550  ops/s

It would be interesting to see how it behaves on different CPUs.

>
>>> This is really great!
>>>
>>> Couldn't this be a tweak via HotSpot, instead uglifying and bloating
>>> the Java and hence the byte code?
> +1
>
>> This is for HotSpot compiler guys to answer. Theoretically I think it is
>> possible. But it would have to be tailored to the very specific use case
>> and I don't know if such specific transformation would have wide-enough
>> applicability. If it would only speed-up String.hashCode and very
>> similar loops, it is less trouble to do that by hand in one or few
>> places...
> I would think this happens in user-specified hashCode() over arrays.
> IDEs would routinely inject the loop like that or delegate to
> Arrays.hashCode, that does the same loop.

Arrays.hasCode() can be "improved" this way too.

>
> In other words, I would like to see this fixed on compiler side. This
> seems to be the strength-reduction playing tricks with loop unrolling,
> I'll submit a bug shortly.

As I said, I don't think it has anything to do with loop unrolling. The 
transformation I applied in hashCode1,2,3, ... 8 produces code that 
executes 2, 3, 4, ... 9 independent multiplications in each chunk, which 
allow CPU's pipeline to execute them in parallel. I had to manually 
unroll the loop a bit just to achieve this transformation. But my manual 
unrolling does not bring the speed-up per se. The parallelization of 
multiplication does. This can be seen by observing the score of 
hashCode3x benchmark, which has the same loop structure as hashCode3, 
but performs multiplications in a way where each of them depends on the 
result of a previous one, which prevents the CPU from parallelizing them.

This is not to say that such transformation couldn't be done on the JIT 
side. I just have a feeling that such transformation won't be widely 
used because it is very specific. It can only be used within integer 
arithmetic of the homogeneous width (since it changes the order of 
operations applied, the final result depends on which width is used when 
overflow happens). Floating arithmetic is equally unsiutable for such 
transformations that change order of operations. It can only help when 
the sequence of operations that are dependent on one another are changed 
into a sequence of independent operations and those operations have a 
weight that matters (such as multiplication).

Regards, Peter

>
> -Aleksey.
>
>