RFR: 8358343: [leyden] Drop notify_all in CompilationPolicyUtils::Queue::pop
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce... But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code. Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations) ------------- Commit messages: - Fix Changes: https://git.openjdk.org/leyden/pull/74/files Webrev: https://webrevs.openjdk.org/?repo=leyden&pr=74&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8358343 Stats: 5 lines in 2 files changed: 3 ins; 1 del; 1 mod Patch: https://git.openjdk.org/leyden/pull/74.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/74/head:pull/74 PR: https://git.openjdk.org/leyden/pull/74
On Tue, 3 Jun 2025 17:59:51 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
@veresov @iwanowww ^ ------------- PR Comment: https://git.openjdk.org/leyden/pull/74#issuecomment-2936964207
On Tue, 3 Jun 2025 17:59:51 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
Is there a reason why can't we just do the processing work in the thread calling the flush instead of waiting for the replay thread? That is, why not make it be like this: void CompilationPolicy::flush_replay_training_at_init(TRAPS) { InstaceKlass* ik; do { ik = _training_replay_queue.try_pop(TrainingReplayQueue_lock, THREAD); if (ik != nullptr) { replay_training_at_init_impl(ik, THREAD); } } while (ik != nullptr); } ------------- PR Comment: https://git.openjdk.org/leyden/pull/74#issuecomment-2947960746
On Fri, 6 Jun 2025 03:22:05 GMT, Igor Veresov <iveresov@openjdk.org> wrote:
Is there a reason why can't we just do the processing work in the thread calling the flush instead of waiting for the replay thread? That is, why not make it be like this:
You tell me :) I guess one upside of current code is to leave draining/processing in one place/thread, and thus never run into false positives/negatives due to diagnostic code (this hunk, gated by `-XX:+AOTVerifyTrainingData`) doing something that production code does not. So I mildly prefer it in current form. I can change it, if you want. ------------- PR Comment: https://git.openjdk.org/leyden/pull/74#issuecomment-2948283345
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'premain' into JDK-8358343-leyden-training-notify-all - Fix ------------- Changes: - all: https://git.openjdk.org/leyden/pull/74/files - new: https://git.openjdk.org/leyden/pull/74/files/1f98bbe5..8af7ae69 Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=74&range=01 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=74&range=00-01 Stats: 95 lines in 26 files changed: 40 ins; 44 del; 11 mod Patch: https://git.openjdk.org/leyden/pull/74.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/74/head:pull/74 PR: https://git.openjdk.org/leyden/pull/74
On Fri, 6 Jun 2025 07:34:46 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
- Merge branch 'premain' into JDK-8358343-leyden-training-notify-all - Fix
Actually the current approach (even with the spin-wait) and my solution too are not really correct. The fact that the queue is empty doesn't mean that every last item has been processed. The last item may have been popped, but still is being worked on. So however you look at it we should set up some kind of a handshake to make sure the replay thread is done processing, not just done popping. ------------- PR Comment: https://git.openjdk.org/leyden/pull/74#issuecomment-2948506858
On Fri, 6 Jun 2025 07:34:46 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
- Merge branch 'premain' into JDK-8358343-leyden-training-notify-all - Fix
Dang, too late. I think this was fixed in `premain` with: https://github.com/openjdk/leyden/commit/ca21af49a1de66b623d5a7481877766e915... ------------- PR Comment: https://git.openjdk.org/leyden/pull/74#issuecomment-3248479147
On Tue, 3 Jun 2025 17:59:51 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
I don't remember. :) I'm just not a big fan of spin-waits even with sleeps inside... ------------- PR Comment: https://git.openjdk.org/leyden/pull/74#issuecomment-2948362494
On Tue, 3 Jun 2025 17:59:51 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
It'd be also a potentially nicely reusable method. Provided that it works of course... ------------- PR Comment: https://git.openjdk.org/leyden/pull/74#issuecomment-2948368134
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Coordinate with replay thread shutdown - Merge branch 'premain' into JDK-8358343-leyden-training-notify-all - Merge branch 'premain' into JDK-8358343-leyden-training-notify-all - Fix ------------- Changes: - all: https://git.openjdk.org/leyden/pull/74/files - new: https://git.openjdk.org/leyden/pull/74/files/8af7ae69..f8c4debb Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=74&range=02 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=74&range=01-02 Stats: 182204 lines in 3878 files changed: 112334 ins; 45443 del; 24427 mod Patch: https://git.openjdk.org/leyden/pull/74.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/74/head:pull/74 PR: https://git.openjdk.org/leyden/pull/74
On Tue, 3 Jun 2025 17:59:51 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Found this when reading premain-vs-mainline webrev. Mainline does not have `notify_all` in this method: https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce...
But if you remove `notify_all()` in `premain`, then tests start to deadlock, see bug for a sample. The culprit is `CompilationPolicy::flush_replay_training_at_init`, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with `-XX:+AOTVerifyTrainingData`, so we don't care much about its performance. This is IMO better than doing a questionable `notify_all` followed by `wait` in load-bearing code.
Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` (5x, no timeouts yet; still running more iterations)
This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/leyden/pull/74
participants (2)
-
Aleksey Shipilev
-
Igor Veresov