RFR: 1598: GitRepository.isHealthy takes a long time in ArchiveWorkItem

Wed Dec 14 12:13:49 UTC 2022

On Mon, 12 Dec 2022 23:52:50 GMT, Erik Joelsson <erikj at openjdk.org> wrote:

> In the mlbridge bot, the `ArchiveWorkItem` is spending an inordinate amount of time running `git fsck --connectivity-only`. This is part of a rather sane strategy of verifying a repository before using it. This is done when a repository is materialized using the `HostedRepositoryPool`, if a previous local copy is found and reused, it's first checked using `Repository::isHealthy`, which runs the fsck. In the `ArchiveWorkItem` case, the repository in question is the archive repository, where all sent emails are stored for reference (and to prevent resending duplicate messages). The repository is materialized first in every `ArchiveWorkItem::run` call as it needs to read the current archive for the PR. Especially on bot restart, this triggers a lot of extra work.
> 
> This repository is growing quite fast as it contains every PR for every repository we have Skara running against on GitHub. Having every ArchiveWorkItem verifying all this contents just doesn't scale well, when each work item will only ever read and write to a single file in the repository.
> 
> To eliminate this inefficiency, I'm changing `ArchiveWorkItem` to read and write the single file over REST API instead. This does introduce another level of inefficiency, where the complete file must be transferred to and from the server each time, instead of just diffs, but I think given the size of the repo, and the rate at which it is growing (not to mention the fsck checks), this is the better tradeoff. 
> 
> The patch adds a new method `HostedRepository::writeFileContents` (that pairs with the existing `::fileContents`). I've added manual tests that verifies these two methods together for both GitHub and GitLab. This prompted rewriting of `GitHubRepository::fileContents` to use "raw" mode for retrieving file contents. This in turn made me discover that we could remove all the preview feature `Accept` headers, as all those GitHub APIs are stable by now. (There is currently a bug in `RestRequest` causing only the last header with the same key to be used, so only the last of the list of preview headers was ever actually in effect. This bug was also preventing me from using raw mode, as that requires another `Accept` header. Given this, and that the mockingbird API is verified to be stable now, I'm confident in removing all of them. I will file a separate issue for the bug in `RestRequest`.)
> 
> To use "raw" mode, I needed to use `RestRequest::executeUnparsed` and discovered that it wasn't behaving quite the same as the normal `::execute`. I changed it to throw `UncheckedRestException` like its sibling. I inspected all uses of the method and made adjustments where needed.
> 
> `TestHostedRepository::fileContents` didn't return the raw contents of files. Specifically it didn't preserve trailing newlines. To fix this, I changed it to no longer convert to lines and then joining them again.

bots/mlbridge/src/main/java/org/openjdk/skara/bots/mlbridge/ArchiveWorkItem.java line 252:

> 250:         var sentMails = new ArrayList<Email>();
> 251:         var archiveContents = bot.archiveRepo().fileContents(mboxFile(), bot.archiveRef());
> 252:         archiveContents.ifPresent(s -> sentMails.addAll(Mbox.splitMbox(s, bot.emailAddress())));

So this adds the old contents to sentMails, and if there is none, we just ignore adding it? I needed some time to figure out why it was safe doing nothing if archiveContents was empty. Perhaps a comment about this?

forge/src/main/java/org/openjdk/skara/forge/HostedRepository.java line 100:

> 98:      * @param authorEmail Email of author and committer for commit
> 99:      */
> 100:     void writeFileContents(String content, String filename, Branch branch, String message, String authorName, String authorEmail);

Maybe not a big deal, but I thought the symmetry with `fileContents` was a bit lost when `String filename` was not the first argument to this method.

forge/src/main/java/org/openjdk/skara/forge/github/GitHubRepository.java line 272:

> 270:             return Optional.of(content);
> 271:         } catch (UncheckedRestException e) {
> 272:             // The onError handler is not used with executeParsed, so have to

Suggestion:

            // The onError handler is not used with executeUnparsed, so have to

forge/src/main/java/org/openjdk/skara/forge/github/GitHubRepository.java line 292:

> 290:                         .put("name", authorName)
> 291:                         .put("email", authorEmail))
> 292:                 .put("content", new String(Base64.getEncoder().encode(content.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8));

Are we talking about 10's of MB here as well? We don't risk running into limits?

-------------

PR: https://git.openjdk.org/skara/pull/1442