RFR: 8324573: HashMap::putAll should resize to sum of both map sizes

Wed Jan 24 21:56:25 UTC 2024

On Wed, 24 Jan 2024 20:54:43 GMT, Joshua Cao <duke at openjdk.org> wrote:

> > The current benchmark and the change don't really cover the case where many keys exist in _both_ maps. Could you add a benchmark for that?
> 
> I added a benchmark that assumes the worse case. Please see the top post. Yes, this change is not always beneficial. I'd like to think the original benchmark, which uses random keys, is closer to real life scenarios. There may be some duplicate keys, but most keys are unique. But this is just my intuition, I do not have data to back this up.

Thanks, I don't have any data on in either, I just think it's worth to consider such cases. But the regression is probably acceptable here.

> One solution would be to loop through both maps and count number of unique keys. My guess is that this would mostly add unnecessary computations. We can compromise with something like `size() / 2 + m.size()`. My guess is the current aggressive approach is the best most of the time, but I don't have strong opinions and am happy to take suggestions.

Something like that might make sense, but I doubt it will be possible to find a sweet spot that works *always*. Doing more complex calculation likely doesn't help.

> > Also `int s = size() + m.size();` can overflow now, leading to different behavior. It might make sense to just cap the value?
> 
> Is it necessary? We can go with @simonis 's suggestion. I think if we move forward with this, it would make sense to change all instances of `size` to long type, which would then suggest we should change the `Map::size()` api to return a long. The reason I am not concerned about overflow is that the Map API implies that we should only care about maps with at most 2^32 items.

With the current implementation it is necessary, especially when considering duplicated keys. The current implementation seems to just skip inserting anything if `s <= 0`, so an overflow would cause different behavior now. Maybe it makes sense to add a early return at the top of the method if `m` is empty and remove the `if (s > 0)`?

Using `double` would make sense because the value is used in double contexts afterwards anyway.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17544#issuecomment-1908977135