Integrated: 1598: GitRepository.isHealthy takes a long time in ArchiveWorkItem

Erik Joelsson erikj at
Wed Dec 14 17:06:44 UTC 2022

On Mon, 12 Dec 2022 23:52:50 GMT, Erik Joelsson <erikj at> wrote:

> In the mlbridge bot, the `ArchiveWorkItem` is spending an inordinate amount of time running `git fsck --connectivity-only`. This is part of a rather sane strategy of verifying a repository before using it. This is done when a repository is materialized using the `HostedRepositoryPool`, if a previous local copy is found and reused, it's first checked using `Repository::isHealthy`, which runs the fsck. In the `ArchiveWorkItem` case, the repository in question is the archive repository, where all sent emails are stored for reference (and to prevent resending duplicate messages). The repository is materialized first in every `ArchiveWorkItem::run` call as it needs to read the current archive for the PR. Especially on bot restart, this triggers a lot of extra work.
> This repository is growing quite fast as it contains every PR for every repository we have Skara running against on GitHub. Having every ArchiveWorkItem verifying all this contents just doesn't scale well, when each work item will only ever read and write to a single file in the repository.
> To eliminate this inefficiency, I'm changing `ArchiveWorkItem` to read and write the single file over REST API instead. This does introduce another level of inefficiency, where the complete file must be transferred to and from the server each time, instead of just diffs, but I think given the size of the repo, and the rate at which it is growing (not to mention the fsck checks), this is the better tradeoff. 
> The patch adds a new method `HostedRepository::writeFileContents` (that pairs with the existing `::fileContents`). I've added manual tests that verifies these two methods together for both GitHub and GitLab. This prompted rewriting of `GitHubRepository::fileContents` to use "raw" mode for retrieving file contents. This in turn made me discover that we could remove all the preview feature `Accept` headers, as all those GitHub APIs are stable by now. (There is currently a bug in `RestRequest` causing only the last header with the same key to be used, so only the last of the list of preview headers was ever actually in effect. This bug was also preventing me from using raw mode, as that requires another `Accept` header. Given this, and that the mockingbird API is verified to be stable now, I'm confident in removing all of them. I will file a separate issue for the bug in `RestRequest`.)
> To use "raw" mode, I needed to use `RestRequest::executeUnparsed` and discovered that it wasn't behaving quite the same as the normal `::execute`. I changed it to throw `UncheckedRestException` like its sibling. I inspected all uses of the method and made adjustments where needed.
> `TestHostedRepository::fileContents` didn't return the raw contents of files. Specifically it didn't preserve trailing newlines. To fix this, I changed it to no longer convert to lines and then joining them again.

This pull request has now been integrated.

Changeset: affcd47d
Author:    Erik Joelsson <erikj at>
Stats:     242 lines in 10 files changed: 210 ins; 11 del; 21 mod

1598: GitRepository.isHealthy takes a long time in ArchiveWorkItem

Reviewed-by: ihse



More information about the skara-dev mailing list