Benefit from computing String Hash at compile time?

Fri Dec 18 09:38:41 PST 2009

I believe the String hashcode computation could be performed eagerly,
piggy-backing on the (already significant complex) copying and/or decoding
code used by its constructors. For every character, first to last,
produced/added to the this.value array, we also update the hashcode: h =
31*h + character => trivial enough to be a basically "free" addition to the
existing loops.

Some details:
- For large strings, we gain a lot because we don't have to visit all the
characters again (posibly when they are not anymore in the CPU cache) when
the hashcode is first computed (perhaps much later than construction).
- The current algorithms reuses the hashcode 0 to mean "not computed", so it
will recopute everything again for strings that just happen to produce 0
with the current formula. Eager computation avoids this risk, however small.
- Some constructors (remarkably for substring and cloning) rely on
Arrays.copyOfRange(), which implementation is more efficient than any Java
loop (I guess it's a HotSpot intrinsic with optimization for alignment
etc.). In that case, using an explicit loop so we can smuggle the hashcode
calculation inside it, will probably have a measurable disadvantage. But
this disadvantage is only for construction (and then only for large
strings); for strings that are ever hashed, the net saving will always be
still positive.
- Eager computation allows to declare hash as final, which may have some
performance benefic, e.g. for caching in registers.
- Eager computation allows hashCode() to be a trivial getter without any
branch.
- The hashcode function can be factored into a private static method, e.g.
int incHash(int currhash), so this tiny algorthm must not be repeated in a
dozen constructors; that method will be trivial to inline so there's no cost
either for compiled code.
- Admittedly, for interpreted code there are higher disadvantages in the
constructor; but then, I expect most String constructors to appear as the
first methods to be optimized by the JIT - they are just BURNING "hot".
- If we have eager computation, I think it's not worthy caching the hashcode
of literal Strings in the Constant Pool; this requires changing the CP spec
and the classfiles will be 4 bytes bigger for every String literal - a lot
of extra bytes considering how many Strings we typically have (including all
Strings from CP symbols). Still, javac could use the "well-known" hashcode
for special needs like strings-in-switch; other optimizations could be used
more aggressively (precomputed hashtables for huge static symbol tables...).
Let's face it, the String hashcode algorithm has changed in the early days,
but it will never change again.

A+
Osvaldo

2009/12/17 James Arlow <jimmyuniversal at yahoo.com>

> I'm not exactly up to date, but even reading entries from December, there
> are talks about computing the string hash at compile time.  This seems like
> a bad idea to me when looking towards future compatibility.
>
> For the sake of posterity, it would make more sense to store the strings as
> literals in the class file, and then compute the hash during the class
> loading process.  The amount of processing at run-time would be negligible,
> and it would eliminate the possibility of errors creeping up from an
> "improved" or non-standard hash function.
>
> While the "improved" case seems unlikely, it would prevent whole sections
> of code from breaking simply because a third party JVM introduced an
> accidental error into the hash process.
>
> I think everyone can agree its best to not risk cutting off open options
> unless there is a critical performance penalty that would be addressed by
> doing so, so for a function that is called once per switch option and only
> when the class is loaded, I think its safe to forget about compile time
> hashes altogether.
>
> If people are really worried about performance, then the best option would
> be to offer two ways to compile the class, one with string literals, for
> compatibility, and one with Java standard hash values, for performance.
>
>
>
>
>
>