RFR: 1153: Some emails still not posted as comments in PR

Thu Sep 2 21:49:53 UTC 2021

After fixing SKARA-1148, there were still some emails that weren't being posted as comments in PRs. The underlying cause this time was emails that Skara thinks it has sent, but they never were, so they weren't present in the mail archive. Emails sent in response to such missing emails have bad In-Reply-To headers as they point to emails that aren't in the archive. When the Mbox parser tries to piece together conversations, it fails to find the correct parents in these cases.

The reason we need to piece conversations together is to find the original RFR email, which contains the pull request link. This is how each email is associated with the correct pull request.

The fix for this is to also look at the References header as a backup. The References header contains the whole ancestor chain of email IDs, so using that, we can connect conversations regardless of missing emails in between, especially since we only really care about the first email in the conversation. Looking at the pipermail thread view, this is what it must be doing as it would otherwise fail miserably. I also tweaked the loops here to hopefully do a bit less work, which I think is worth it due to the pretty large number of emails being parsed here.

In addition to this small change to Mbox, I'm also tinkering with some other things that are related. These were things I stumbled over and needed to properly debug and test the issue. 

As I noted in SKARA-1148 already, the MailingListArchiveReaderBot is doing a lot of redundant work. I blame this on SKARA-843, where this bot was changed from one global instance to running one instance for each configured repository. There were two problems introduced with this change.

1. Each instance still has all the configured repositories, so every found conversation is evaluated against every repository in every instance. This creates an unnecessary N^2 complexity in number of configured repositories.
2. Each instance has its own MailingListReader, configured for exactly the set of mailing lists used for the repository. We have a lot of repositories that share the exact same mailing list config. This means that each of these instances will read all the archives for themselves, with no sharing of this data. The reader already does caching, so after the first time around, it's much faster. Still, the first time around we read jdk-updates-dev a pretty large number of times.

My solution for 1 is to only have one single repository in each MailingListArchiveReaderBot. This looks like a simple oversight in the previous patch. 

For 2, I make sure to only create one MailingListReader for each unique set of mailing lists. This will not remove all the redundant reads, but it will bring them down significantly. I also think it's the functionally correct solution as we will then only consider email threads on the relevant mailing lists for each repository, so less unnecessary (and potentially bad) cross evaluation between mailing lists and repositories that aren't configured as related.

To get 2 to actually work correctly, I needed to tweak the logic that protects us from running the wrong WorkItems concurrently. To be able to share MailingListReader between multiple WorkItems, we have WorkItem::concurrentWith. Unfortunately, this method was also used to detect duplicate WorkItems in the pending pool of the scheduler. This is correct in most cases, but not here. An ArchiveReaderWorkItem for jdk11u and jdk15u shares MailingListReader, but queueing one should not replace the other. To solve this I added another method WorkItem::replaces for this particular check (with a default method to preserve the current behavior for all other WorkItems).

Finally I also changed ArchiveReaderWorkItem::toString to print the repository name rather than the list of mailing lists. This made a lot more sense to me while debugging as we now create an instance per repository rather than based on mailinglists.

-------------

Commit messages:
 - SKARA-1153

Changes: https://git.openjdk.java.net/skara/pull/1214/files
 Webrev: https://webrevs.openjdk.java.net/?repo=skara&pr=1214&range=00
  Issue: https://bugs.openjdk.java.net/browse/SKARA-1153
  Stats: 156 lines in 8 files changed: 100 ins; 18 del; 38 mod
  Patch: https://git.openjdk.java.net/skara/pull/1214.diff
  Fetch: git fetch https://git.openjdk.java.net/skara pull/1214/head:pull/1214

PR: https://git.openjdk.java.net/skara/pull/1214