RFR: 8186958: Need method to create pre-sized HashMap [v7]

Wed Mar 30 00:16:41 UTC 2022

On Thu, 24 Mar 2022 17:43:31 GMT, XenoAmess <duke at openjdk.java.net> wrote:

>> 8186958: Need method to create pre-sized HashMap
>
> XenoAmess has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - update jmh
>  - refine javadoc; refine implement when expectedSize < 0

OK, finally got some time to look at this. Here's a rewrite of the spec words, at least for HashMap::newHashMap. If this settles down, I'll write the CSR for this and LHM and WHM.

    /**
     * Creates a new, empty HashMap suitable for the expected number of mappings.
     * The returned map uses the default load factor of 0.75, and its initial capacity is
     * generally large enough so that the expected number of mappings can be added
     * without resizing the map.
     *
     * @param numMappings the expected number of mappings
     * @param <K>         the type of keys maintained by this map
     * @param <V>         the type of mapped values
     * @return the newly created map
     * @throws IllegalArgumentException if numMappings is negative
     * @since 19
     */

The original wording was taken from CHM, which generally is a reasonable thing to do. Unfortunately, it's wrong. :-) "Table size" isn't accurate; HashMap uses "capacity" as the term for the number of buckets (== length of the internal table array). "Size" means "number of mappings" so its use of "table size" confuses these two concepts. The CHM wording also uses "elements" which applies to linear collections; the things inside a map are usually referred to as "mappings" or "entries". (I guess we should fix up CHM at some point too.)

While "expectedSize" isn't inaccurate, it's not tied to the main text, so I've renamed it to numMappings.

There are a couple other javadoc style rules operating here. The first sentence is generally a sentence fragment that is short and descriptive, as it will be pulled into a summary table. (It's often written as if it were a sentence that begins "This method..." but those words are elided.) Details are in the rest of the first paragraph. The text for `@param`, `@return`, and `@throws` are generally also sentence fragments, and they have no initial capital letter nor trailing period.

--

On performance and benchmarking: this is a distraction from the main point of this effort, which is to add an API that allows callers a correct and convenient way to create a properly-sized HashMap. Any work spent on optimizing performance is almost certainly wasted.

First, the benchmark: it's entirely unclear what this is measuring. It's performing the operation 2^31 times, but it sends the result to a black hole so that the JIT doesn't eliminate the computation. One of the actual results is 0.170 ops/sec. This includes both the operation and the black hole, so we don't actually have any idea what that result represents. _Maybe_ it's possible to infer some idea of relative performance of the different operations by comparing the results. It's certainly plausible that the integer code is faster than the float or double code. But the benchmark doesn't tell us how long the actual computation takes.

Second, how significant is the cost of the computation? I'll assert that it's insignificant. The table length is computed once at HashMap creation time, and it's amortized over the addition of all the initial entries and subsequent queries and updates to the HashMap. Any of the computations (whether integer or floating point) are a handful of nanoseconds. This will be swamped by the first hashCode computation that causes a cache miss.

Third: I'll stipulate that the integer computation is probably a few ns faster than the floating-point computation. But the computation is entirely non-obvious. To make up for it being non-obvious, there's a big comment that explains it all. It's not worth doing something that increases performance by an insignificant amount that also requires a big explanation.

Finally, note that most of the callers are already doing a floating-point computation to compute the desired capacity, and it doesn't seem to be a problem.

Sorry, you probably spent a bunch of time on this already, but trying to optimize the performance here just isn't worthwhile. Let's please just stick with our good old `(int) Math.ceil(numMappings / 0.75)`.

--

There should be regression tests added for the three new methods. I haven't thought much about this. It might be possible to reuse some of the infrastructure in the WhiteBoxResizeTest we worked on previously.

--

I think it would be good to include updates to some of the use sites in this PR. It's often useful to put new APIs into practice. One usually learns something from the effort. Even though this is a really simple API, looking at use sites can illuminating, to see how the code reads. This might affect the method name, for example.

You don't need to update all of the use sites in the JDK, but it would be good to choose a reasonable sample. Maybe the ones from a single package, or a handful (like java.lang or java.util). Maybe include Class::enumConstantDirectory. If this use site is updated, then maybe it will allow us to get rid of that ConstantDirectoryOptimalCapacity test that we problem-listed in the previous PR.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7928