RFR: 8229012: When single stepping, the debug agent can cause the thread to remain in interpreter mode after single stepping completes [v4]

Tue Feb 18 20:49:55 UTC 2025

On Tue, 18 Feb 2025 19:42:07 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> That's a good question. I hadn't noticed the extra check for step->pending, and it looks like that was also in place for the previous version that relied on calling ClearFramePop on each outstanding NotifyFramePop. I think it is just bit rot as the code initially evolved. If it was possibly for the flag to be cleared by some other thread (I actually think it might be, but need to look closer), this extra check is not of much help since it could be cleared after the check. Let me look into this a bit more.
>
> There are multiple threads that can end up calling clearStep(). However, all callers of clearStep() grab the eventHandler_lock() first, with one somewhat obscure exception:
> 
> debugInit_reset
> threadControl_reset
> removeVThreads
> clearThread
> stepControl_clearRequest
> clearStep
> 
> This is done for a debugger reset after a connection has been dropped. Probably not at much risk of a race in clearStep, but I think is best to add eventHandler locking here also. After doing this there will be no race conditions to worry about.
> 
> I'm pretty sure this extra check for "pending" is just bit rot. I know I wanted an if block around the new code to make it easy to disable so I could easily benchmark and test with and without "the fix". I think at one point the code looked a lot different (I think there was something outside of the if), but that was long ago.

I looked a bit more...

There are multiple threads that can end up calling clearStep(), so we need to be careful about races. There are two main paths into clearStep.
-The first is through stepControl_endStep. It always grabs the handler_lock and stepControl_lock.
-The 2nd is through through stepControl_clearRequest, which grabs no locks itself, but there are some already grabbed when it is called. It should really grab the stepControl_lock, but it can't because that lock is suppose to be grabbed before threadControl_lock, which has already been grabbed.  The proper order is handler_lock -> stepControl_lock -> threadControl_lock. So we need to fix locking for stepControl_clearRequest somehow.

There are 3 paths into stepControl_clearRequest().

1. The first goes through clearThread via removeThread via removeResumed.  This path always holds the handler_lock and threadControl_lock.
2. The 2nd goes through clearThread via removeThread via threadControl_onEventHandlerExit. This also holds the handler_lock and threadControl_onEventHandlerExit.
3. The 3rd goes through clearThread via removeVThreads via threadControl_reset. This path always holds the handler_lock, but not threadControl_lock

Holding the threadControl_lock here does not help us since stepControl_endStep does not grab that lock. Holding the eventHandler_lock does help since stepControl_endStep does grab it, but the 3rd path above does not grab the eventHandler_lock. I think the simplest fix here is to just make it do so. But that's using the eventHandler_lock to make stepControl code thread safe. Not exactly clean. So that means instead requiring the caller of stepControl_endStep to do all the needed locking. Basically add locking of stepControl_lock somewhere. Adding locking of stepControl_lock in clearThread is too late since we already hold the threadLock. It really needs to be pushed up the call chain of each of the above 3 paths to the point just before grabing the threadLock. For example for the 2nd call path above we have:

    if (ei == EI_THREAD_END) {
        eventHandler_lock(); /* for proper lock order - see removeThread() call below */
        debugMonitorEnter(threadLock);

So we need to add stepControl_lock() in between those too lines. We need to do similar for the 1st and 3rd paths also.

I think this is best held off for another PR. I can file a CR for it now with these notes. It's not a new problem, although might be slightly more at risk of going awry since we are doing a lot more in clearStep() now, but so far non of my testing has turned up any issues, so I think a proper fix can wait. It also relates to the ranked locking support project, which I hope to start back up soon.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23182#discussion_r1960569052