HashMap - Hybrid approach

Alex Yakovlev alex14n at gmail.com
Sat Jun 13 23:47:52 UTC 2009


On Fri, Jun 12, 2009 at 6:23 PM, Doug Lea<dl at cs.oswego.edu> wrote:

> This was fun to figure out. Performance on some of our
> tests involving get() is sensitive to the serial
> correlation of slot indices for the keys used across
> successive calls.

Are there any tests that have get() order the same as put() order?
Data in my map is stored almost in insertion order :)

> As another quick experiment, I just tried a hacked
> version of IdentityHashMap carrying hashes
> array, and did see around a 20% improvement in some
> tests for get() versus version with index-dependence.
> So some variant of direct access still seems likely to
> be worth pursuing.

Sure. But checking stored hashcode always have limited time,
and user's equals method is unpredictable and can be slow.
We could accumulate nanoseconds statistics for equals calls
and choose either to check hashcode first or try direct key.equals,
don't know. In my observations it depends on locality of user's data,
if equals operates only local properties - direct equals is faster,
and even String with it's remove char[] is already slower than
checking saved hashcode first. Another way is to check couple
of keys contents with reflection, if all contents are just primitive
fields - it's save to call direct equals, if not - check hashcode first.
But user can access some static classes, have some heave
computations, etc.

> I'll try to put together a more serious version along
> the lines of my suggestions, although maybe not for
> a few days.

Waiting with a lot of interest.

Meanwhile I've tried an open map approach with my smart bits,
like marking 'end-of-list', putting first hash bin element always
on it's place, etc. In some tests it gives significant performance
improvement due to sequental access in all arrays - max 2 cache
misses in get. It certainly needs another bit distribution function
and load factor management. But on some data it gets piled up
very heavilly with awful performance.

Maybe this can solved with some bit flags, not only 'end of list' but
also choosing between +1 step and some large step - like
those suggested by Ismael (in python). Or even a bit flag
to look into some external overflow storage.

Now I've implemented a bit of this approach in my map -
when next element in hash table is empty it is used
by its neighbours for faster access. It gives some
improvement on data with bad hashcodes distribution
(when there are many second lookups).

But source code becomes more and more complex
and I'm afraid too much conditions and branches
could harm performance.


More information about the core-libs-dev mailing list