Possible HashMap update

Wed Jul 3 22:55:22 UTC 2013

A few weeks ago, in the course of doing a (hopefully-last) reworking of
JDK8 ConcurrentHashMap (CHM) internals based on experience with
preview versions, I noticed that the balanced tree scheme now in place
as a fallback in the case of usages with many keys all with the same
hashCode could also be profitably used in other cases: CHM (like
HashMap) previously performed some bit-scrambling of user hashCodes to
make it less likely that user hashCodes that were non-identical but
highly-non-random will collide in the same bins (thus with O(n) vs the
expected O(1) performance).  This bit-scrambling is an
anti-optimization in many usages that do not contain the kinds of
systematic hashCode bias that most hurt performance.  While it is
reasonably fast, it still places computation where you do not
want it: in front of the likely cache-miss to access an entry. Plus,
bit-scrambling is only statistically likely to help.  It is still
possible to encounter hashCodes that systematically collide. It would
be nice to provide O(log N) guarantees for these cases as well.

So, CHM now omits bit-scrambling (well, almost -- just a single xor to
avoid losing impact of top bits), and instead rolls over into using
trees for bins that contain a statistically unlikely number of keys,
and rolling back into simple bin form if/when they don't due to
deletions or resizings. ("Statistically unlikely" is a surprisingly
low number. More than 8 nodes are expected only about once per ten
million under perfectly random keys; this is also often (depending on
cost of equals() methods, additional dispatch overhead, etc), around
the break-even point for using trees vs lists).  The updated tree
mechanics now provide O(log N) worst-case performance in either of two
cases: (colliding non-identical-hashCodes), and (identical hashCodes
but mutually Comparable keys). And does so while also speeding up by a
little those typical usages that do not encounter systematic
collisions.  Also, there is no sense at all in using any form of
hashSeed in this scheme. It basically behaves the same whether or not
you use one.

The overall impact appears to be a net win.  Non-heavily-colliding
cases are a bit faster. Some colliding cases are a little slower and
use more memory (tree-based nodes are bigger than plain ones), but
have much better worst (and typical) case bounds unless people use
crazily-bad keys that all have same hashCodes but are never equal or
comparable to others, in which case treeification is wasted effort
(but only by a constant factor of about 2 in both time and space).

This seemed to be enough of an improvement in terms of both worst-case
and expected performance to try applying it to TreeBin-enabled
HashMap. So I gave it a try. To enable bidirectional tree<->plain node
conversion, I used the subclassing strategy mentioned during
discussion of the initial balanced tree version. This in turn requires
a completely different within-package internal API to allow
LinkedHashMap to continue to subclass HashMap.  A couple of internal
constructions exist solely to allow the same source code to be
identical in HashMap and ConcurrenHashMap (but not the same compiled
code, since the types they operate on are and must be unrelated in the
two classes, and because they live in different packages, there's no
nice way to express their commonalities without exposing.)

This note serves as a pre-proposal request for comment, as well a
request for anyone interested in trying it out, especially in any
weird/unusual use cases, to report experiences before seriously
considering it as a replacement. To simplify logistics, I checked in the
code in a workspace directory in jsr166:

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/dl/java/util/HashMap.java?view=log

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/dl/java/util/LinkedHashMap.java?view=log