RFR: 1680: NotifyBot::getPeriodicItems takes too long

Wed Nov 23 18:57:00 UTC 2022

On Wed, 23 Nov 2022 14:00:48 GMT, Magnus Ihse Bursie <ihse at openjdk.org> wrote:

>> When starting up the notify bot, the first round of `getPeriodicItems` was observed to make >2500 calls to github for repository branches (taking a total of 18 minutes). This call originates from `NotifyBot::isOfInterest`, which looks for pre-integration branches matching the PR. There is one call for every PR being considered, plus pagination. The JDK repository has so many open PRs at the moment that this call needs 8 pages. ~250 PRs x 8 pages is 2000 calls just for that repo. On top of this, there is also another check in `isOfInterest` that looks for a "ready comment". This check needs to parse all comments of a PR, which there can also be quite a few of, making it deceptively expensive.
>> 
>> The method `Bot::getPeriodicItems` is expected to a be a quick method that gets called at a set interval to spawn new `WorkItem` instances. The `BotRunner` calls all the `getPeriodicItems` serially, so if they start taking significant time, it adds up pretty quickly. This is especially true in the very first round after a bot restart, where a lot of PRs need to be evaluated. The intent of having this "quick" discarding of PRs in `getPeriodicItems` was to avoid the overhead of spawning WorkItems for PRs that could easily be identified as not needing it, but in this case, it seems to have backfired.
>> 
>> To fix this, I'm moving the `isOfInterest` method away from `NotifyBot` and into `PullRequestWorkItem::run` and making it the first thing to be called. By doing this, we add the overhead of always spawning a `WorkItem`, but we move the heavy processing into the WorkItems, who execute with high concurrency, away from the serially executed `getPeriodicItem`. Looking at metrics data, it's pretty clear that the initial `getPeriodicItems` isn't able to saturate the thread pool with enough WorkItems today due to all these expensive checks, so this change alone should improve latency quite significantly for the first round.
>> 
>> To further reduce unnecessary work, I'm rearranging `isOfInterest` a bit, and only checking branches if the labels aren't found on the PR, and the prbranch notifier is actually configured on the repo. I'm also changing how we check for the existence of the pre integration branch. Instead of fetching all branches (which may result in 8+ pages), I'm rewriting the `HostedRepository::branchHash` method to return an `Optional<Hash>` that will be empty on 404. That makes it usable for checking if a branch exists.
>
> forge/src/main/java/org/openjdk/skara/forge/github/GitHubRepository.java line 300:
> 
>> 298:                 .onError(r -> r.statusCode() == 404 ? Optional.of(JSON.object().put("NOT_FOUND", true)) : Optional.empty())
>> 299:                 .execute();
>> 300:         if (branch.contains("NOT_FOUND")) {
> 
> I'm trying to figure out the logic here. So for a 404, you return a special "NOT_FOUND" object, which you detect after the request is done, to return Optional.empty().
> 
> But for any other errors..? This `var` syntax might be convenient when working in an IDE, but it sure makes code reviewing difficult. I assume `branch` is a JSON object, and that the results from onError are unwrapped before returning it from the request. But how will this work with the return statement below? 
> 
> branch.get("commit").get("sha") seems like it will fail with an exception...

This is an established pattern in Skara that I Just copied. The `onError` lambda returns an optional. If that optional is empty, then the default error behavior happens, which is to throw exception. So by only returning an object on 404, then that's the only result code we are changing the behavior on.

I'm not super satisfied with returning a JSON object with a special string in to signal something, it feels very scripty to me, but it's what we have.

-------------

PR: https://git.openjdk.org/skara/pull/1422