[External] : Re: Object to ObjectMonitor mapping using a HashTable to avoid displaced headers

Thu Feb 15 06:53:11 UTC 2024

Hi Roman,

> How bad is that? Have you ever observed this to be an actual problem? (It seems kind-of the reverse of what is currently done with the INFLATING protocol and doesn’t seem very much worse than that. Correct me if I’m wrong here.)

No, never observed it being an issue. Hitting the window between  `is_being_async_deflated()` transitioning the markWord is so small that I do not believe that it has been observed.

However before we added resizing to the HashTable there were runs where we had internal buckets with linked lists of size >100. Noticable time was then spent in insert, so I would guess that similar times would seen for remove (walking the links was costly). So the window until a thread could inflate again would be significant. This more of a TTS problem. Any safepoint would still need to also wait for the deflation thread to finish removing the ObjectMonitor from the hash table, put polling for safepoint earlier from inflating threads means that the worst time would be MAX(deflation, inflation + enter) instead of deflation +  inflation  + enter. 

> Uhhh, ok. Wait a second, aren’t threads that want to inflate a monitor already at a safepoint?

They are just java threads running in VM. Or do you mean something else? 

> My gut feeling would have been no more than 4, perhaps even less, maybe just 1 (which would make the code much simpler). Rationale is that, how many OMs can you have that *are hot* at the same time on a given thread. If a single thread needs to deal with several nested monitors, then surely only the innermost 1-2 monitors would be hot?
From some of the larger benchmark runs I ran last year a size of 3 outperformed a size of 1. (Some larger improvements, and no significant regressions) But maybe 2 is correct. A lot of the benchmarks that were run last year happened when many things were in flux. As mentioned in the mail a more targeted evaluation of just the cache size should be done now that less things are locked down.

> I am not sure that observed cache-misses is the most relevant metric, here. What’s relevant is how deep does the cache have to be to make a performance difference. Or maybe I am over-optimistic about how complex locking schemes can be ;-)
I agree, these metrics are more for evaluating when performance regression are observed to see the actual behaviour of specific workloads. It seems like there are some wild locking out there especially when you run benchmark suites covering over 25 years of software development.

> Right. I would think that we do want to be able to shrink the table. I don’t think that we want to stick with a large table forever, only because we had an OM spike at some time (e.g. start-up). But I have no data yet to support that.
Yes. Worth think about good schemes for deciding when to shrink the table.

> Is there a way to do it concurrently?
I am talking about the thread state. The resizing is concurrent. (With this being done while ThreadBlockInVM it actually resizes straight through safepoints). I need to do more research / ask more knowledgeable people about this. The String and Symbol table does this from the service thread while in VM, but creates its own grow loop which polls for safepoint per bucket. We’ve treated the ConcurrentHashTable very much as a blackbox. It may be time to get some of the runtime engineers that have had more experience with this particular implementation involved in the discussion. Both how to handle resizing, and what to expect w.r.t. performance, and gotchas with this specific	use case. 

> Are there any notably interaction in the inflation protocol when we have to manipulate both the lock-bits *and* the hash-bits? I don’t think there would be, but maybe you can confirm?
There will be multiple CAS’es as we first install the hash and then transition the header to monitor. But the idea is that the only reason this happens is because of “real" lock contention, so an extra CAS is the least of our problems from a performance perspective.

> This sounds useful! I assume this is only done in the runtime, and fast-paths call into the runtime on first CAS-failure?
Correct.

>  For this, could you, once the recursive-LW stuff is all pushed, update/merge latest jdk into your jdk-based branch?
I will. 

Speaking of which. I was going to ask you how we should handle development (especially if more are getting involved).

For now I will work on changes based on the OMWorld Lilliput branch. And replicate (cherry-pick) these where applicable into my private mainline based branch. Until I am an official committer in the Lilliput I believe me and StefanK can figure out the technicalities around pushing to the branch.

So essentially, work on Lilliput/OMWorld, keep a private branch based the latest mainline merged in Lilliput/master (with OMWorld work replicated) and keep merging in Lilliput/master into Lilliput/OMWorld.

> Ugh. Any idea why that is so?

Only speculations. The compiled vs not compiled issue seems like a more binary thing where we might be (maybe for good reasons) more restrictive than necessary. This is probably the easier of the two to get an understanding of. The C1 vs C2 issue is probably caused by many factors, and I have nothing but educated guesses here. Something for engineers more versed in the compiler to tackle. 

> I would suggest to break up your efforts into tasks that are absolutely essential, and tasks/ideas that can be done after the first big thing has landed and stabilized. This would follow pretty much the structure of this document - focus on primary stuff first, and do most of the rest later. It’s just so hard to handle so many moving parts at the same time.

Sounds good. After recursive lightweight is integrated in mainline and merged into Lilliput (and OMWorld) there should be a lot less code motion surrounding the affected code. I will create sub-tasks (and figure out their dependencies) as you suggest.

Thanks for the response.
Sincerely,
Axel