RFR: 8322743: assert(held_monitor_count() == jni_monitor_count()) failed [v2]

Fri Jan 19 23:03:33 UTC 2024

On Fri, 19 Jan 2024 14:42:22 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Address review comments
>
> Ok, so I did some rudimentary study of EA. And now this PR makes much more sense ;)
> 
> Let me summarize my understanding of the issue:
> An object gets allocated in interpreter, and we lock on it in the interpreter.
> OSR is triggered, the object is passed in as OSR parameter, we hold the lock.
> The OSR control flow now looks like this:
> 
> 
> StartOSR:
>   LoadP -> load the object created in interpreter
>   we have not_global_escape(LoadP) == false
>   so this is correctly marked as escaping
>   now the osr path injects into the middle of the loop
> 
> Loop:
>   Phi -> merge interpreter obj and that from this compiled code
>   we have not_global_escape(Phi) == false
>   ...
>   Unlock(Phi)
>   ...
>   check some condition, maybe return
>   ...
>   obj = CheckCastPP( Allocate(i.e. new Object()) )
>   we have not_global_escape(obj) == true
>   this is correct, the object will never escape
>   Lock(obj)
>   ...
>   goto Loop
> 
> 
> So if I understand this correctly, the marking in/with the ConnectionGrap is correct:
> - The object passed in through OSR is marked as escaping.
> - The object created locally is marked as non-escaping.
> - The loop-phi that merges the two must therefore also be possibly escaping.
> 
> The question is then with the condition of Lock removal:
> Can we remove the lock, just because its object is marked as non-escaping?
> At first glance: obviously, because nobody else could ever have the object, and so nobody can ever lock/unlock it.
> 
> In the example, if we look at the Unlock node, we cannot remove it (at least at first):
> its object is possibly escaping, because the Phi is not marked non-escaping.
> But we can remove the Lock, since its object is non-escaping.
> This is where the trouble starts.
> 
> I think it is exactly for this reason, that @vnkozlov thinks one cannot just look at the object of the individual Lock/Unlock node, but one has to look at all Lock/Unlock nodes of a BoxLock, and see if all objects are non-escaping.
> 
> @vnkozlov please correct me if I got something wrong ;)
> 
> I was trying to see what the meaning of the BoxLockNode is, but I did not find any useful documentation. Can you help me out here? Your patch assumes that all "relevant" Lock/Unlock nodes share the same BoxLockNode. Why is that the case?

Thank you, @eme64 for review and for "diving" into the issue to understand it. Your conclusion is correct.

First, when not-escaped object merged by Phi node with escaped one we only mark such object "Not Scalar Replaceable" `NSR`:

JavaObject(5) NoEscape(NoEscape) NSR [ [ 155 160 215 213 101 99 ]]   143  Allocate 

We can **not** eliminate it but we can still do some optimizations for it, like CMP nodes optimization and Locks elimination.
Unfortunately in this case it share `Unlock` node with escaped object so we can't eliminate `Unlock` and related `Lock`.
It is bug that we eliminated `Unlock` based only on knowledge that `Lock` can be eliminated.

There is "balanced monitors" rule: on any code path number of executed Locks and Unlocks for locked object should match. Even when an object is "local" and no other threads can see it as in this case. You either eliminate all, keep all or prove that you can eliminate some but keep them balance (as we do for `Lock Coarsening`).
This bug breaks this rule.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17331#issuecomment-1901265695