RFR: 8315884: New Object to ObjectMonitor mapping [v9]

Tue Jul 16 12:46:55 UTC 2024

On Mon, 15 Jul 2024 00:50:30 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

>> When inflating a monitor the `ObjectMonitor*` is written directly over the `markWord` and any overwritten data is displaced into a displaced `markWord`. This is problematic for concurrent GCs which needs extra care or looser semantics to use this displaced data. In Lilliput this data also contains the klass forcing this to be something that the GC has to take into account everywhere.
>> 
>> This patch introduces an alternative solution where locking only uses the lock bits of the `markWord` and inflation does not override and displace the `markWord`. This is done by keeping associations between objects and `ObjectMonitor*` in an external hash table. Different caching techniques are used to speedup lookups from compiled code.
>> 
>> A diagnostic VM option is introduced called `UseObjectMonitorTable`. It is only supported in combination with the LM_LIGHTWEIGHT locking mode (the default). 
>> 
>> This patch has been evaluated to be performance neutral when `UseObjectMonitorTable` is turned off (the default). 
>> 
>> Below is a more detailed explanation of this change and how `LM_LIGHTWEIGHT` and `UseObjectMonitorTable` works.
>> 
>> # Cleanups
>> 
>> Cleaned up displaced header usage for:
>>   * BasicLock
>>     * Contains some Zero changes
>>     * Renames one exported JVMCI field
>>   * ObjectMonitor
>>     * Updates comments and tests consistencies
>> 
>> # Refactoring
>> 
>> `ObjectMonitor::enter` has been refactored an a `ObjectMonitorContentionMark` witness object has been introduced to the signatures. Which signals that the contentions reference counter is being held. More details are given below in the section about deflation.
>> 
>> The initial purpose of this was to allow `UseObjectMonitorTable` to interact more seamlessly with the `ObjectMonitor::enter` code. 
>> 
>> _There is even more `ObjectMonitor` refactoring which can be done here to create a more understandable and enforceable API. There are a handful of invariants / assumptions which are not always explicitly asserted which could be trivially abstracted and verified by the type system by using similar witness objects._
>> 
>> # LightweightSynchronizer
>> 
>> Working on adapting and incorporating the following section as a comment in the source code
>> 
>> ## Fast Locking
>> 
>>   CAS on locking bits in markWord. 
>>   0b00 (Fast Locked) <--> 0b01 (Unlocked)
>> 
>>   When locking and 0b00 (Fast Locked) is observed, it may be beneficial to avoid inflating by spinning a bit.
>> 
>>   If 0b10 (Inflated) is observed or there is to...
>
> Axel Boldt-Christmas has updated the pull request incrementally with 10 additional commits since the last revision:
> 
>  - Remove try_read
>  - Add explicit to single parameter constructors
>  - Remove superfluous access specifier
>  - Remove unused include
>  - Update assert message OMCache::set_monitor
>  - Fix indentation
>  - Remove outdated comment LightweightSynchronizer::exit
>  - Remove logStream include
>  - Remove strange comment
>  - Fix javaThread include

Another review pass by me. It looks to me like the cache lookup can be improved, see comments below.

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 323:

> 321:       ldr(t1, Address(t3_t));
> 322:       cmp(obj, t1);
> 323:       br(Assembler::EQ, monitor_found);

I think the loop could be optimized a bit, if we start with the (cache_address) - 1 in t3, then increment t3 at the start of the loop, and let the success-case fall-through and only branch back to loop-start or to failure-path. Something like:

bind(loop);
increment(t3_t, in_bytes(OMCache::oop_to_oop_difference()));
ldr(t1, Address(t3_t));
cbnz(t1, loop);
cmp(obj, t1);
br(Assembler::NE, loop);
// Success

Advantage would be that we have no forward-branch in the fast/expected case. CPU static branch prediction tends to not like that. I'm not sure if if makes a difference, though. Also, if you do that, then the unrolled loop also needs corresponding adjustment.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 674:

> 672: 
> 673:       // Search for obj in cache.
> 674:       bind(loop);

Same loop transformation would be possible here.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 776:

> 774:     movl(top, Address(thread, JavaThread::lock_stack_top_offset()));
> 775: 
> 776:     if (!UseObjectMonitorTable) {

Why is the mark loaded here in the !UOMT case, but later in the +UOMT case?

-------------

PR Review: https://git.openjdk.org/jdk/pull/20067#pullrequestreview-2179942149
PR Review Comment: https://git.openjdk.org/jdk/pull/20067#discussion_r1679210139
PR Review Comment: https://git.openjdk.org/jdk/pull/20067#discussion_r1679313050
PR Review Comment: https://git.openjdk.org/jdk/pull/20067#discussion_r1679315158