RFR: 1606: PullRequestPoller always processes the last updated MR for GitLab [v4]

Fri Oct 7 17:55:49 UTC 2022

On Fri, 7 Oct 2022 09:19:16 GMT, Erik Helin <ehelin at openjdk.org> wrote:

> First of all, I think your changes are fine, the reasoning and the code checks out. I have a bunch of suggestions, but those you can choose whether you want to pursue or not, they do not have to be made within the scope of this PR. Smaller things to consider:

I also have a lot of ideas for how to further optimize. I'm trying to work iteratively on this so that each change can be easily proven correct (to make reviews easier and to hopefully avoid some mistakes). I also want to evaluate the metrics I have added between changes to make sure we attack the right things. 

> * in `GitLabMergeRequest`, since you are already caching the `comparisonSnapshot`, should we instead cache the `notes` JSON result? There are several methods in `GitLabMergeRequest` that makes use of the `notes` JSON, then they would be able to utilise this cached result as well (saving a number of requests). This thought can of course be extended to other JSON objects that `GitLabMergeRequest` fetches as well.

Maybe, but I'm not sure if every fetch of a GitLabMergeRequest will eventually result in fetching notes. I thin I have been leaning towards a lazy fetch of notes instead. That would still be two queries, or rather an extra query per MR, but still a lot less than we have today.

> * very minor nit, but I would have named `comparisonSnapshot` just `snapshot`, but that is up to you 😄

I picked the name to make it clear that the intended purpose of this snapshot object is only for comparisons. Different implementations will include different sets of data, so no guarantees are given on anything besides that. Not sure I succeeded with that name though, so may change as suggested.

> * more of a question than a comment: the snapshot functionality will mean that we won't even run the bots if a MR hasn't changed, but what is the main cost of running the bots on a MR that hasn't changed? My guess would be that the expensive part are all the requests that the bots have to make in order to figure out that they don't have to do anything, but if we cache more aggressively, would that cost go away? Anyways, just a thought

There are plenty of other things we do in most WorkItems. We often instantiate a local workspace based on the PR source branch. This is quite expensive (5-10s in some repos). We instantiate some kind of Census instance, also in the order of seconds. In the mailinglist bridge, instantiating the marks workspace takes over 1 minute(!) but that is something I plan to address separately.

> Now for a bigger experiment 🧑‍🔬 One annoying thing is that the GitLab REST API doesn't return the comments for a MR when fetching a single MR (unlike GitHub which does). So I played around a bit with the GitLab GraphQL API and think I managed to achieve this:

Playing around with GraphQL could certainly help reducing the number of queries needed. I haven't tried that yet.

-------------

PR: https://git.openjdk.org/skara/pull/1380