RFR: 7903894: Agents can vanish without a trace

Sun Aug 24 04:44:25 UTC 2025

Can I please get a review of this change which proposes to address the issue noted in https://bugs.openjdk.org/browse/CODETOOLS-7903894?

One part of jtreg's implementation involves pooling agents. Each agent is a construct which communicates with an agent server, over a socket, to execute a test's actions (compile, main, junit etc...). Each agent server is a Java process and ultimately runs the test's actions. Once the test action completes, the agent is then pooled so that any subsequent actions can be run against that agent, when relevant. 

While being pooled it can so happen that some activity in the agent server (typically a daemon thread that's running in that Java process) can cause the JVM to crash and the agent server process to exit. At a later moment when jtreg picks an eligible agent from the pool, it can end up picking up an agent whose agent server process has already died. When that agent is chosen and instructed to run the test's action (through an instruction over the socket), that communication fails - typically with a variety of communication related error messages including the one noted in the linked issue - "Error. Agent communication error: java.net.SocketException: Connection reset by peer". This communication failure causes the test action to fail.

The commit in this PR addresses this issue by checking whether the agent server process is alive when placing the agent in the pool and when fetching it out of the pool to allocate it for a test action. With the changes in this PR, if the agent server process isn't alive then it is no longer placed in the pool nor is it used for the test action. This prevents the communication exception that have been reported due to agent server processes crashing and jtreg subsequently using it for test execution.

Do note that, the sequence of picking a agent for test action execution and the agent server process crashing is inherently racy - i.e. the agent server process could crash immediately after the agent has been chosen for test action execution. The commit in this PR does not address that ineherent race (and it's not the goal of this change). The current change however should be able to prevent several of the failures that happen due to using a dead agent.

I initially experimented with a self test to reproduce this issue and verify this fix. It wasn't straightforward given the nature of this issue, but I did manage to write one. However, that test can result in intermittent failures (due to timing issues), so I decided not to introduce one. I have however manually run some tests which reproduce the original issue and verify that the fix addresses the problem noted in the linked issue.

Existing self tests continue to pass. While at it, I also removed a couple of unused internal methods from the `Agent` class.

-------------

Commit messages:
 - 7903894: do not reuse a agent whose Process has died

Changes: https://git.openjdk.org/jtreg/pull/276/files
  Webrev: https://webrevs.openjdk.org/?repo=jtreg&pr=276&range=00
  Issue: https://bugs.openjdk.org/browse/CODETOOLS-7903894
  Stats: 42 lines in 1 file changed: 17 ins; 15 del; 10 mod
  Patch: https://git.openjdk.org/jtreg/pull/276.diff
  Fetch: git fetch https://git.openjdk.org/jtreg.git pull/276/head:pull/276

PR: https://git.openjdk.org/jtreg/pull/276