RFR: 1042: Watchdog causing multiple restarts for mlbridge

Erik Joelsson erikj at openjdk.java.net
Fri May 14 20:29:04 UTC 2021


When starting certain bots with a fresh scratch area, we currently end up in a restart loop. This is because all the threads immediately get busy cloning repos, which starves out the watchdog pings for longer than the hard coded 10 minutes. This patch changes the watchdog to use the configuration setting "watchdog" for the restart timeout instead. This value is currently used for a log warning which is also driven by the watchdog, so to be able to still have separate values, I've introduced a new option "watchdog_warn" which can optionally be set for just the warning part.

In addition to this, I also added a bit more logging to make it easier to follow through logstash when watchdog pings occur, or when a new instance of a bot runner is started. Failure to start due to configuration errors are now also posted using proper logs.

-------------

Commit messages:
 - Only read watchdogWarnTimeout from config once
 - Add separate watchdog_warn config setting to fix test
 - Use watchdog timeout for restart instead of hardcoded 10m

Changes: https://git.openjdk.java.net/skara/pull/1157/files
 Webrev: https://webrevs.openjdk.java.net/?repo=skara&pr=1157&range=00
  Issue: https://bugs.openjdk.java.net/browse/SKARA-1042
  Stats: 23 lines in 5 files changed: 15 ins; 0 del; 8 mod
  Patch: https://git.openjdk.java.net/skara/pull/1157.diff
  Fetch: git fetch https://git.openjdk.java.net/skara pull/1157/head:pull/1157

PR: https://git.openjdk.java.net/skara/pull/1157


More information about the skara-dev mailing list