HashMap - Hybrid approach

Wed Jun 10 12:59:32 UTC 2009

Alex Yakovlev wrote:
> To have single-indirect access we need to read
> key/value objects with the first array read.

Backing up a little, there are five main forces
in hash table design. These are fresh in my mind because
I've been trying to put together CustomConcurrentHashMap,
the last planned JSR166y JDK7 component, that handles
weak/soft refs, custom equality comparisons, eviction, etc.
(Concurrency adds further forces omitted here.)

1. Memory stalls
    - Cache misses due to randomization of indexing can easily
      cost hundreds of cycles. This is made worse if there are
      multiple levels of indirection each with poor locality.
      (Multiple levels with good locality are much less of a
      problem, which is why ConcurrentHashMap is pretty fast
      despite adding one). And this is made even worse when
      there are too many computations before processor can even
      do a prefetch, which is one reason why bit-spreading functions
      require careful speed vs uniformity tradeoff analysis.
2. Method Dispatching
    - Both Object.equals and Object.hashCode are often
      "megamorphic", i.e., cannot be inlined because they
      are overridden in too many classes. In some usages,
      this accounts for the main performance bottleneck.
      So caching hashcodes, which usually also allows skipping
      calls to equals for different objects, has a noticeable effect.
      Also, it is sometimes possible to bias the structure of code
      to encourage compilers to do more per-call inlining,
      which helps a lot. Note that these effects can be difficult
      to test. Test programs that include multiple key types
      (like our DenseMapMicrobenchmark) are a little more
      accurate indicators of megamorphism effects, but
      even here, these kinds of microbenchmarks will often
      cause compilers to do deeper inlining than they would in
      most actual usages.
3. Collision handling
    - Collisions lead to some sort of slower scanning. Especially
      given the highly variable quality of user-supplied
      hashCodes, you'd like to localize the impact of collisions
      to avoid per-map slowdowns. Linear-probes and related techniques
      don't fare very well on these grounds.
4. Biasing for expected operation mixes
    - I once checked that collection (and in particular HashMap/Hashtable)
      usage approximately obeys the usual 4:1 read vs write ratio
      seen in just about every aspect of computing. In fact it seems
      a bit more extreme -- something like 84% read (get() or iterate)
      15% add (put() and variants) and 1% delete (remove, clear). So
      making reads fast is clearly the highest priority.
5. Footprint
    - Many Java frameworks (especially EE) can create hundreds of
      thousands of collections, many of them very small, but
      also some huge ones. This leads to second-order bloat
      effects, like high TLB miss rates, and much slower GC. For
      some details see papers by Gary Sevitsky and colleagues at
http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html
      This is a difficult issue with hash tables since the whole
      idea is to preallocate space to allow random indexing. And
      because resizing is not cheap, you don't want to be overly
      incremental about it.

> 
> And we cannot store collisons in pre-allocated
> arrays - to read it from the same array with keys/values
> it needs to be an object. Thus writing speed on
> my original map will always be faster.
> 

Right. My thought here is to use an alternative structure
for collisions. So, given capacity C, you'd have:
   elements[] -- 2*C of key/value
   hashes[] -- 1*C of hash + bookkeeping bits
   collisions -- a dynamic bitwise trie or somesuch
Compared to current HashMap...
   + It allows parallel prefetches on get; of both
     elements[2*h] and hashes[h], so should reduce memory stalls.
   - It triples the unloaded footprint. You can get most of
     this back in default-constructor usages by not allocating
     arrays, but instead using only collisions tree until size
     reaches say 4, and then using a larger (say 64) initial capacity
     once you know you aren't a tiny map.
   = Collisions may or may not be a little more expensive to resolve
   - Iterators are likely slower and definitely a lot messier
   - The trie nodes needed here would require 2 or 3 more
     pointer fields than do list-based nodes, so the incremental
     footprint grows faster.
   - Resizing is slower and more complicated since the collisions
     may or may not be moved to main table.

Without fleshing this out, I'm not sure whether this hits the right
tradeoffs.

-Doug