[master] RFR: ObjectMonitor Storage

Tue Apr 12 16:47:25 UTC 2022

On Tue, 12 Apr 2022 14:18:52 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

> > We worked very hard to get rid of TSM. I would cry a bit if this monster gets brought back from the dead. I also don't know if this compression is sound. I have seen SpecJBB2015 create millions of ObjectMonitors. So it doesn't sound too far fetched to assume that a larger app, for example one with a TB sized heap, would run out of address space for ObjectMonitors, even if only a reasonable percentage of objects need a monitor. I would very much prefer a solution that instead removes the ObjectMonitor pointer from the markWord completely, in favour of publishing the mapping with a (possibly optimized) hash table instead. Then you only need the hashCode in the markWord, TSM can stay dead, and it will scale better to large apps. Lookup time might be smaller, but that sounds like the better price to pay I think.
> 
> Hmm, yeah. However, this is not as simple as it seems, that global lookup table needs to be MT safe. Probably something based on concurrentHashTable.hpp, at least for a first cut (probably structurally similar to what I did in #11 for GC forwarding). Do you have an particular optimizations in mind for the use-case?

Hi @fisk,  @rkennke,

I thought this through yesterday, but I am really snowed in, so sorry for not responding quickly.

I think it's not so easy. 

Let's say we add a Hashtable. We have concurrent adds and lookups. Lookups can happen in JIT, so we need a decoding mechanism in MacroAssembler, like we have with narrow Klass pointer decoding. But  but instead of a simple shift+add this would be a hashtable lookup. Depending on the hashtable this can be less or more complex (e.g. a closed hashing table may be somewhat simpler to code, but more memory consuming). But it is already more complex than shift+add. 

Atop of that, the concurrency aspect makes it really difficult. Maybe I'm just not seeing it.

Today in mainline, concurrency is handled silently by the libc. We have concurrent mallocs(). Libc handles this for us, at a certain silent expense (e.g. we pay RSS for that in the form of malloc arenas, or we have silent contention).

My OM store handles concurrency by bulk-allocating OMs and then a new allocation is usually just taking from a thread local freelist. But those bulk-allocated OMs are still inactive and have no association with an object. The association is done by just writing the OM pointer into the mark word, and the oop into the OM.

But a central hashmap would have to be maintained for every changed OM<->oop association. Concurrently to lookup, in compiled code.

I honestly think that such a solution would be more complex than a linear lookup table like my proposed OM store. Note that if you don't like the memory management aspect of it, we still could just have a linear pointer table, and keep OMs allocated in C-heap. That would reduce a bit of complexity at the cost of one more indirection and memory (both added libc overhead and the pointer table itself). The advantage would be that we would not have to pre-reserve a large memory range, less virtual memory usage.

About the "limit" aspect - lets say we have 32-bit narrow OM pointers, then we can have 4 billion OMs. Do you really see that as a limit? Arguably, at that point, we would have a lot of other problems as well. We would spend 1TB alone for OMs (with each OM being 256 bytes, if we align them to cache line size).

I'm still thinking. I'm sorry for causing pain, and I like reducing complexity much more than inflating it.

In my defense, I was hoping my solution (AddressStableArray and friends) could be reused in other places, e.g. as backing memory for Thread (or maybe even Klass, if we ever manage to make it homogeneously sized).

Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/lilliput/pull/39