RFR: 1699: CommitCommentsWorkItem does excessive unnecessary work

Erik Joelsson erikj at openjdk.org
Tue Dec 20 21:19:20 UTC 2022

The `CommitCommentsWorkItem` gets scheduled once for each repository by `PullRequestBot::getPeriodicItems`. It's responsible for querying the repository for new comments on commits, and if any are found, spawn `CommitCommandWorkItem` for them. It's unfortunately not very straight forward to query forges for new commit comments. For GitHub we have a rather complicated GraphQL that does the trick, but for GitLab, we need to jump through some rather nasty hoops.

The main issue is that the data we get from the "events" query from GitLab contains notes posted to commits, but without the commit hash. We only get the commit "title" to identify which commit was commented on. To solve this, we have a rather elaborate logic that first builds a complete map from title to set of hashes for every commit in the repo. This is done preemptively on a local clone of the repo. Using this map and some clever comparisons, we can figure out which commit each comment belongs to.

I don't have data on how long it takes to build this map for a big repository, like the JDK, but it's likely not trivially short. In addition to this, before using any local clone of a repository, we always run the git fsck check (see [SKARA-1598](https://bugs.openjdk.org/browse/SKARA-1598)), which can take around 20s for the JDK repo. The local clone is unfortunately also used to check if the commits for the found comments belong to a valid branch. However, both of these operations are read only, so we shouldn't really need a clone for this. We should be fine just using the seed repository directly from the `HostedRepositoryPool`.

This patch tries to make this situation better in several ways.

1. Use the seed repository instead of a clone in `CommitCommentsWorkItem`. (Removes ~20s fsck time every time this is scheduled for a jdk repository, which is in the order of once a minute times the numer of JDK repo clones that Skara operates on)
2. Refactored `HostedRepository::recentCommitComments` to take a `ReadOnlyRepository` instead of the pre calculated map. `GitLabRepository` is then responsible for using this repository to build the map itself.
3. `HostedRepository::recentCommitComments` now takes an `updatedAfter` arg to limit the number of comments that get returned. In `GitLabRepository` this was already limited to last 4 days, but in `GitHubRepository`, we just got the last 100 comments, regardless of age (due to a limitation in GraphQL). We can still filter the returned comments on updatedAt so that we get less comments to process in many cases. This should make a big impact on bot restarts. 
4. `CommitCommentsWorkItem` has a configuration option to ignore comments from certain users (typically meant for bot users, e.g. the notifier who posts a comment on every commit). This configuration wasn't working due to a bug in the factory. (Together with 3, should bring down the number of scheduled `CommitCommandWorkItem` on bot restart from 100 per repo to a handful and often 0.)
5. `GitLabRepository` keeps a cached copy of the commitTitleToHash map. It will also only build or update the map if any new notes are actually found. (This won't have a huge impact, but should speed things up a little bit more)


Commit messages:
 - Cache titleToCommit map between calls
 - SKARA-1699

Changes: https://git.openjdk.org/skara/pull/1446/files
 Webrev: https://webrevs.openjdk.org/?repo=skara&pr=1446&range=00
  Issue: https://bugs.openjdk.org/browse/SKARA-1699
  Stats: 117 lines in 7 files changed: 77 ins; 21 del; 19 mod
  Patch: https://git.openjdk.org/skara/pull/1446.diff
  Fetch: git fetch https://git.openjdk.org/skara pull/1446/head:pull/1446

PR: https://git.openjdk.org/skara/pull/1446

More information about the skara-dev mailing list