Faster HashMap implementation
Doug Lea
dl at cs.oswego.edu
Wed Jul 1 12:27:50 UTC 2009
While I'm posting in-progress stuff, here are some other notes on hash
maps:
First, some observations about usage (repeating some
already posted):
Via some dated but probably still valid dynamic measurments:
Ops: Approx 84% read, 15% add/modify, 1% remove
Sizes: Peak at 0, 1, 2, then mostly flat distribution across all others
Static guesses from frequencies of code forms, mainly via google code search
Key types
50-60% Strings
10-20% Numbers -- mainly Integer, then Long, then others
10-20% Identity-based (no override of equals/hashCode)
10-30% eveything else
Reads might be around 70% get, 30% traverse?
get: At least 80% successful (no null checks on get)
Iterators: Probably less than 20% Entry, others about half key vs value
Puts: typically add, not modify
The tables below (from MapMicrobenchmark) show that cache misses
completely dominate performance for large maps. You'd like to both
push out the sizes for which they come into play, and lessen the
number of indirections (thus misses) even when they do. HashMapV2
mostly does the latter. Of course, for cache misses to dominate, you
also need good key hashing and collision strategies to cope with
possibly poor hashCode()'s.
Times are approximately nanosecs per element/op. Here are runs on an
Intel x86 linux 2.6.24-23 with jdk6 -server.
Absolute times vary on other platforms but show the same patterns (*).
HashMap:
Type/Size: 36 144 576 2304 9216 36864 147456 589824
BigDecimal 40 37 43 47 57 174 267 317
BigInteger 39 34 38 39 48 187 247 288
String 37 34 38 42 54 149 227 253
Double 41 36 39 44 49 80 219 250
Float 40 38 41 44 48 95 216 267
Long 34 29 33 34 43 74 194 242
Integer 38 33 36 39 46 70 199 233
Object 36 33 35 39 47 64 215 275
average 38 34 37 41 49 111 223 265
HashMapV2:
Type/Size: 36 144 576 2304 9216 36864 147456 589824
BigDecimal 34 34 40 44 58 146 255 286
BigInteger 28 25 29 34 44 151 220 243
String 34 29 34 40 59 136 208 238
Double 37 32 32 44 46 55 180 192
Float 34 37 38 42 42 67 165 246
Long 26 21 22 29 32 40 123 147
Integer 29 25 30 37 46 60 137 157
Object 34 31 33 41 49 64 196 250
average 32 29 32 38 47 89 185 219
Here are some more details listed across some operation categories
(via updated MapCheck) using as keys, English words (Strings) from a
big dictionary. Times are scaled differently than above, and these
runs are from AMD-x86/solaris, also typical of others:
HashMap : V2
Access Present : 213 : 200
Add Absent : 265 : 245
Modify Present : 231 : 199
Remove Present : 111 : 101
Search Absent : 119 : 93
Traverse access entry : 173 : 137
Traverse access key/val: 147 : 87
(*) At least on Solaris jdk6u14, adding the -XX:+AggressiveOpts option
also seems to enable the specialized Integer version of HashMap, which
improves performance on some tests but worsens on others. On all
platforms, the -XX:+AggressiveOpts switch seems to lead to more
run-to-run variation.
Here's synosis of MapMicrobenchmark pasted from javadoc:
* A micro-benchmark with key types and operation mixes roughly
* corresponding to some real programs.
*
* The main results are a table of approximate nanoseconds per
* element-operation (averaged across get, put etc) for each type,
* across a range of map sizes.
*
* The program includes a bunch of microbenchmarking safeguards that
* might underestimate typical performance. For example, by using many
* different key types and exercising them in warmups it disables most
* dynamic type specialization. Some test classes, like Float and
* BigDecimal are included not because they are commonly used as keys,
* but because they can be problematic for some map implementations.
*
* By default, it creates and inserts in order dense numerical keys
* and searches for keys in scrambled order. Use "r" as second arg to
* instead use random numerical values, and "s" as third arg to search
* in insertion order.
More information about the core-libs-dev
mailing list