RFR: 7903580: Allow for re-attempting agent creation when an attempt fails
    Jaikiran Pai 
    jpai at openjdk.org
       
    Mon Nov 13 14:05:35 UTC 2023
    
    
  
On Mon, 13 Nov 2023 13:54:56 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:
> Can I please get a review of this change which proposes to implement the enhancement request noted in https://bugs.openjdk.org/browse/CODETOOLS-7903580?
> 
> Several times in our CI instances we have noticed that a test (action) fails because it is unable to create an `Agent` instance due to the socket connection not being established between the `AgentServer` and the `ServerSocket` running within the jtreg process. This causes the test itself to fail with `Error. Cannot get VM for test: java.net.SocketTimeoutException: Accept timed out`. It has also been noticed that most of the times this is intermittent and subsequent attempt for a different test (action) passes. 
> 
> The proposed change here introduces a new configuration parameter `--agent-attempts` under "Agent Pool Options", which allows for configuring the maximum number of attempts that are allowed for getting an agent for an action. This is an optional parameter and by default it has been set to a value of `2`, which by default then allows the agent creation to be retried once if the previous attempt fails. This then means that existing installations/usages of jtreg need not set a value for `--agent-attempts` to enroll for this feature. A value of `1` for this parameter implies that the agent creation attempt will be done only once, which is what the behaviour is currently. 
> 
> Inability to obtain an agent, for whatever reason, after attempting `--agent-attempts` times continues to result in test (action) failure, like it does today.
> 
> Some additional minor logging related changes have been done too, to help debugging some of the current observed failures.
> 
> This change has been tested in the following manners:
> 
> 1. This modified build of jtreg has been used to run tier1, tier2 and tier3 of JDK mainline and it has been verified that nothing regresses with this change. That however doesn't mean that agent creation re-attempt logic was tested in these runs (since the first attempt never failed).
> 2. Locally this build was jtreg was additionally modified to selectively not connect to the `ServerSocket`'s port thus simulating a "Accept timed out" exception. That change then verified that the re-attempt logic does indeed correctly kick in and the test action passes after picking up the newly created agent.
> 3. `--agent-attempts 1` was passed to the above modified build (from step 2) to verify that the re-attempt isn't attempted when `--agent-attempts 1` and the test fails as currently with "test result: Error. Cann...
src/share/classes/com/sun/javatest/regtest/exec/RegressionScript.java line 1241:
> 1239:                     p.log("Re-attempting agent creation, attempt number " + i);
> 1240:                 }
> 1241:                 Agent agent = p.getAgent(absTestScratchDir().toFile(), jdk, vmOpts.toList(),
I initially wanted to do this re-attempt logic within the `Pool.getAgent(...)` method, but the way the pool is implemented, its methods are all `synchronized`. The `Agent` instance creation which happens in the `pool.getAgent(...)` involves socket connection attempt (which can wait for a long time). This effectively is a stop the world operation for jtreg process because none of the other methods on the pool, like returning back an agent from an already complete test (action) stalls till this method returns. The `synchronization` constructs in the pool isn't something that's introduced in this PR and instead are already present (and may need some improvement, but that's for a separate discussion/fix). Adding this re-attempt logic within the `pool.getAgent(...)` is going to extend that "stall the world" operation. So I decided to have this implemented outside of that method. Since anyway this retry logic is an internal detail, we can move this logic into the `Pool`, once/if we improve
 s the synchronization constructs in it.
-------------
PR Review Comment: https://git.openjdk.org/jtreg/pull/173#discussion_r1391156987
    
    
More information about the jtreg-dev
mailing list