Integrated: 7903894: Agents can vanish without a trace

Tue Aug 26 09:30:58 UTC 2025

On Sun, 24 Aug 2025 04:40:33 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

> Can I please get a review of this change which proposes to address the issue noted in https://bugs.openjdk.org/browse/CODETOOLS-7903894?
> 
> One part of jtreg's implementation involves pooling agents. Each agent is a construct which communicates with an agent server, over a socket, to execute a test's actions (compile, main, junit etc...). Each agent server is a Java process and ultimately runs the test's actions. Once the test action completes, the agent is then pooled so that any subsequent actions can be run against that agent, when relevant. 
> 
> While being pooled it can so happen that some activity in the agent server (typically a daemon thread that's running in that Java process) can cause the JVM to crash and the agent server process to exit. At a later moment when jtreg picks an eligible agent from the pool, it can end up picking up an agent whose agent server process has already died. When that agent is chosen and instructed to run the test's action (through an instruction over the socket), that communication fails - typically with a variety of communication related error messages including the one noted in the linked issue - "Error. Agent communication error: java.net.SocketException: Connection reset by peer". This communication failure causes the test action to fail.
> 
> The commit in this PR addresses this issue by checking whether the agent server process is alive when placing the agent in the pool and when fetching it out of the pool to allocate it for a test action. With the changes in this PR, if the agent server process isn't alive then it is no longer placed in the pool nor is it used for the test action. This prevents the communication exception that have been reported due to agent server processes crashing and jtreg subsequently using it for test execution.
> 
> Do note that, the sequence of picking a agent for test action execution and the agent server process crashing is inherently racy - i.e. the agent server process could crash immediately after the agent has been chosen for test action execution. The commit in this PR does not address that ineherent race (and it's not the goal of this change). The current change however should be able to prevent several of the failures that happen due to using a dead agent.
> 
> I initially experimented with a self test to reproduce this issue and verify this fix. It wasn't straightforward given the nature of this issue, but I did manage to write one. However, that test can result in intermittent failures (due to...

This pull request has now been integrated.

Changeset: 38095440
Author:    Jaikiran Pai <jpai at openjdk.org>
URL:       https://git.openjdk.org/jtreg/commit/38095440bf690deced8d2fa19f1b4a0a9f5a21f4
Stats:     42 lines in 1 file changed: 17 ins; 15 del; 10 mod

7903894: Agents can vanish without a trace

Reviewed-by: cstein

-------------

PR: https://git.openjdk.org/jtreg/pull/276