RFR (XS) CR 8058643: (str) Re-examine hashCode implementation

Thu Sep 25 16:58:21 UTC 2014

On 09/25/2014 06:00 PM, Vitaly Davidovich wrote:
> Note that compiler does not use multiplication here.  This is strength 
> reduced into a bit shift followed by a subtraction (i.e. h * 31 = h * 
> 32 (i.e. sal 5) - h).

I've noticed that in the disassembly files provided by Aleksey. Good to 
know, so one is not bothered to use bit shifts in situations where 
multiplication / division look clearer in source code.

>   The point about allowing execution units to run in parallel (given 
> the break in data dependency) is very valid though.
>
> For kicks and giggles, I looked at what gcc, clang, and icc generate 
> for similar C code at O2 and O3, and did not observe them making any 
> transformations to break the data dependency.

So HotSpot can break the ice here ;-)

>
> While it would be cool if compiler could do this transformation on its 
> own, as you say Peter, the scope of where this transform is applicable 
> is somewhat narrow(?).  If speeding up the non-cached 
> String.hashCode() or Arrays.hashCode() methods is desired, perhaps 
> intrinsics could be implemented (utility for String.hashCode seems low 
> though).  In that case, one could then do this transformation manually 
> and also vectorize it for long inputs.

I see there are multiple options - each with it's own benefits and 
drawbacks. Hard to choose.

Regards, Peter

>
> On Thu, Sep 25, 2014 at 7:09 AM, Peter Levart <peter.levart at gmail.com 
> <mailto:peter.levart at gmail.com>> wrote:
>
>     On 09/25/2014 09:40 AM, Aleksey Shipilev wrote:
>
>         Hi Peter,
>
>         On 09/25/2014 02:46 AM, Peter Levart wrote:
>
>                     http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench.java
>                     <http://cr.openjdk.java.net/%7Eplevart/misc/StringHash/HashBench.java>
>
>         Interesting.
>
>         I have to say it once again:
>           a) It is *an error* to use static finals as the benchmark
>         input. They
>         are perfectly constant-foldable in way too many cases. Break
>         this habit,
>         please.
>
>
>     Hi Aleksey,
>
>     The "constant" in this example is only a reference to the char[]
>     array. It's content is not. In String, this is a final instance
>     field, which behaves similarly inside an instance method (usually
>     it is read only once per method invocation).
>
>           b) Explicit Blackholes are not needed, and just returning
>         "int" from
>         @Benchmark method helps readability a lot. Please break this
>         habit as
>         well. Having readable and maintainable benchmarks is a key.
>
>
>     Ok, here's a modified benchmark:
>
>     http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench2.java <http://cr.openjdk.java.net/%7Eplevart/misc/StringHash/HashBench2.java>
>
>     Which behaves similarly.
>
>     Here are it's results:
>
>      * Results on JDK9, Linux, i7-2600K CPU, JMH args: -f 1 -wi 5 -i 8
>     -gc true
>      *
>      * Benchmark                     Mode  Samples  Score  Score
>     error  Units
>      * j.t.HashBench2._hashCode     thrpt        8 8308858.217
>     <tel:8308858.217> 353019.084  ops/s
>      * j.t.HashBench2.hashCode0     thrpt        8   8207337.729
>     217048.634  ops/s
>      * j.t.HashBench2.hashCode1     thrpt        8  13359572.359
>     345736.675  ops/s
>      * j.t.HashBench2.hashCode2     thrpt        8  15310621.202
>     238369.017  ops/s
>      * j.t.HashBench2.hashCode3     thrpt        8 17637944.829
>     <tel:17637944.829> 232155.847  ops/s
>      * j.t.HashBench2.hashCode3i    thrpt        8 17724181.444
>     <tel:17724181.444> 509913.288  ops/s
>      * j.t.HashBench2.hashCode3x    thrpt        8   8344128.432
>     159508.813  ops/s
>      * j.t.HashBench2.hashCode4     thrpt        8  16526850.489
>     969549.448  ops/s
>      * j.t.HashBench2.hashCode5     thrpt        8  17567765.554
>     917934.885  ops/s
>      * j.t.HashBench2.hashCode6     thrpt        8 17705074.332
>     <tel:17705074.332> 419405.652  ops/s
>      * j.t.HashBench2.hashCode7     thrpt        8  18805633.563
>     209181.299  ops/s
>      * j.t.HashBench2.hashCode8     thrpt        8  18300123.201
>     376681.550  ops/s
>
>     It would be interesting to see how it behaves on different CPUs.
>
>
>                 This is really great!
>
>                 Couldn't this be a tweak via HotSpot, instead
>                 uglifying and bloating
>                 the Java and hence the byte code?
>
>         +1
>
>             This is for HotSpot compiler guys to answer. Theoretically
>             I think it is
>             possible. But it would have to be tailored to the very
>             specific use case
>             and I don't know if such specific transformation would
>             have wide-enough
>             applicability. If it would only speed-up String.hashCode
>             and very
>             similar loops, it is less trouble to do that by hand in
>             one or few
>             places...
>
>         I would think this happens in user-specified hashCode() over
>         arrays.
>         IDEs would routinely inject the loop like that or delegate to
>         Arrays.hashCode, that does the same loop.
>
>
>     Arrays.hasCode() can be "improved" this way too.
>
>
>         In other words, I would like to see this fixed on compiler
>         side. This
>         seems to be the strength-reduction playing tricks with loop
>         unrolling,
>         I'll submit a bug shortly.
>
>
>     As I said, I don't think it has anything to do with loop
>     unrolling. The transformation I applied in hashCode1,2,3, ... 8
>     produces code that executes 2, 3, 4, ... 9 independent
>     multiplications in each chunk, which allow CPU's pipeline to
>     execute them in parallel. I had to manually unroll the loop a bit
>     just to achieve this transformation. But my manual unrolling does
>     not bring the speed-up per se. The parallelization of
>     multiplication does. This can be seen by observing the score of
>     hashCode3x benchmark, which has the same loop structure as
>     hashCode3, but performs multiplications in a way where each of
>     them depends on the result of a previous one, which prevents the
>     CPU from parallelizing them.
>
>     This is not to say that such transformation couldn't be done on
>     the JIT side. I just have a feeling that such transformation won't
>     be widely used because it is very specific. It can only be used
>     within integer arithmetic of the homogeneous width (since it
>     changes the order of operations applied, the final result depends
>     on which width is used when overflow happens). Floating arithmetic
>     is equally unsiutable for such transformations that change order
>     of operations. It can only help when the sequence of operations
>     that are dependent on one another are changed into a sequence of
>     independent operations and those operations have a weight that
>     matters (such as multiplication).
>
>     Regards, Peter
>
>
>         -Aleksey.
>
>
>
>