RFR: 1598: GitRepository.isHealthy takes a long time in ArchiveWorkItem [v2]

Erik Joelsson erikj at openjdk.org
Wed Dec 14 14:26:38 UTC 2022


> In the mlbridge bot, the `ArchiveWorkItem` is spending an inordinate amount of time running `git fsck --connectivity-only`. This is part of a rather sane strategy of verifying a repository before using it. This is done when a repository is materialized using the `HostedRepositoryPool`, if a previous local copy is found and reused, it's first checked using `Repository::isHealthy`, which runs the fsck. In the `ArchiveWorkItem` case, the repository in question is the archive repository, where all sent emails are stored for reference (and to prevent resending duplicate messages). The repository is materialized first in every `ArchiveWorkItem::run` call as it needs to read the current archive for the PR. Especially on bot restart, this triggers a lot of extra work.
> 
> This repository is growing quite fast as it contains every PR for every repository we have Skara running against on GitHub. Having every ArchiveWorkItem verifying all this contents just doesn't scale well, when each work item will only ever read and write to a single file in the repository.
> 
> To eliminate this inefficiency, I'm changing `ArchiveWorkItem` to read and write the single file over REST API instead. This does introduce another level of inefficiency, where the complete file must be transferred to and from the server each time, instead of just diffs, but I think given the size of the repo, and the rate at which it is growing (not to mention the fsck checks), this is the better tradeoff. 
> 
> The patch adds a new method `HostedRepository::writeFileContents` (that pairs with the existing `::fileContents`). I've added manual tests that verifies these two methods together for both GitHub and GitLab. This prompted rewriting of `GitHubRepository::fileContents` to use "raw" mode for retrieving file contents. This in turn made me discover that we could remove all the preview feature `Accept` headers, as all those GitHub APIs are stable by now. (There is currently a bug in `RestRequest` causing only the last header with the same key to be used, so only the last of the list of preview headers was ever actually in effect. This bug was also preventing me from using raw mode, as that requires another `Accept` header. Given this, and that the mockingbird API is verified to be stable now, I'm confident in removing all of them. I will file a separate issue for the bug in `RestRequest`.)
> 
> To use "raw" mode, I needed to use `RestRequest::executeUnparsed` and discovered that it wasn't behaving quite the same as the normal `::execute`. I changed it to throw `UncheckedRestException` like its sibling. I inspected all uses of the method and made adjustments where needed.
> 
> `TestHostedRepository::fileContents` didn't return the raw contents of files. Specifically it didn't preserve trailing newlines. To fix this, I changed it to no longer convert to lines and then joining them again.

Erik Joelsson has updated the pull request incrementally with one additional commit since the last revision:

  Update forge/src/main/java/org/openjdk/skara/forge/github/GitHubRepository.java
  
  Co-authored-by: Magnus Ihse Bursie <mag at icus.se>

-------------

Changes:
  - all: https://git.openjdk.org/skara/pull/1442/files
  - new: https://git.openjdk.org/skara/pull/1442/files/de5bf32b..eaab258f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=skara&pr=1442&range=01
 - incr: https://webrevs.openjdk.org/?repo=skara&pr=1442&range=00-01

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/skara/pull/1442.diff
  Fetch: git fetch https://git.openjdk.org/skara pull/1442/head:pull/1442

PR: https://git.openjdk.org/skara/pull/1442


More information about the skara-dev mailing list