RFR: 8297200: java/net/httpclient/SpecialHeadersTest.java failed once in AssertionError due to selector thread remaining alive [v4]

Mon Nov 28 15:52:21 UTC 2022

On Fri, 25 Nov 2022 12:15:47 GMT, Daniel Fuchs <dfuchs at openjdk.org> wrote:

>> The java/net/httpclient/SpecialHeadersTest.java test has been observed failing once, due to a selector thread still running after all operations had terminated. This change slightly alter the test logic to check the shared client only at the end of the test when all methods have been executed, and focussed the checks performed by a test method to the only clients that this method is using. This may or may not prevent the observed issue to ever reproducing, but should at least provide more information on where the selector manager thread is at the time it is observed running, should the test fail again:
>> the logic of the HttpClient should have woken up the selector when the last operation finished, which should have ensured a prompt termination of the thread.
>> In addition this change fixes a small race condition in the logic that attempts to decide if the selector is still alive: the selector should be considered alive until its run() method has been called and has exited (or Thread::start has thrown).
>
> Daniel Fuchs has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
> 
>  - Merge branch 'master' into SpecialHeadersTest-8297200
>  - Update test/jdk/java/net/httpclient/ReferenceTracker.java
>    
>    Co-authored-by: Andrey Turbanov <turbanoff at gmail.com>
>  - Fixed race condition in detecting if selector is still alive
>  - 8297200
>  - 8297200

Hello Daniel, I've been looking at this proposed change, the current test and the failure logs that I have access to. The changes being proposed here at a high level are as follows:

1. Keeping track of whether or not the selector thread was started successfully (through `Thread.start()` call). Using that information to (only) decide whether a selector is active or not in reference tracking implementation.
2. Cleaning up the (internal) implementation of the (test code) `ReferenceTracker`
3. Updating the test to check for references per client tracker instead of all trackers.

Doing `#2` and `#3` changes I think are fine (I added some review comments for those but those can be addressed in a straightforward way).

`#1` is where I am having more and more questions. But before I get to those, I was looking at the test log run that failed and I couldn't find any crucial debug logs that our HttpClient generates. That would have given some hint on whether or not the selector manager thread shutdown was initiated. Looks like the `SpecialHeadersTest` doesn't set the `jdk.internal.httpclient.debug`. That also means that our logic in the `beforeMethod` to stop on first failure ended up being a no-op. Perhaps we should add that system property and wait for some runs to fail to capture the necessary details?

Coming to the failing `testHomeMadeIllegalHeader` test - this test issue a request whose `HttpRequest` instance is of a custom class and includes a disallowed request header. The test then calls `client.send(req, BodyHandlers.ofString());` and expects that to fail with a `IllegalArgumentException`. And looking at the logs, it does fail with this exception. We then set the `client = null`, run a GC and waits for 500 millis for all references and operations to finish. But apparently this test fails because the selector manager is still alive. The selector manager can only be alive if the facade is non-null (it is null from the logs) or pending operation count is more than 0. In this specific test the pending operation count will always be 0 since the request failed very early in the `send` operation and we never increment the pending operation count at that time. So that means that the selector manager thread would have been woken up and it would have started the shutdown process. The 
 `shutdown` process in the selector manager can be expensive since it closes all connections, but in this specific test there should be no connections at all since the request generation itself fails very early. So perhaps that means the selector manager thread has just started (and marked itself as alive) but hasn't yet reached the `selector.select` statement when the tracker checked if it was alive? Looking at the code that seems possible and perhaps explains why shortly later the tracker in its logging noted that the selector manager is no longer alive. This would then mean that the tracker is (and will always?) be inherently racy, isn't it?

Coming to the part `#1` change - so it appears that we now maintain a state if the selector manager thread doesn't start (i.e. fails in the rare case where `Thread.start()` throws Exception). We consider the selector manager to be alive in such cases. But this "aliveness" is only used for reference tracking. I think, if the selector manager thread has failed to start, that would mean the `HttpClient` instance cannot function, isn't it? But as far as I can see it wouldn't start throwing up errors but would silently accumulate requests (I'm using the term `requests` loosely here, but I mean the internal `AsyncEvent` instances). Did I read the code correctly? If so, should we be somehow propagating this thread start failure as actual exceptions whenever requests are issued?

-------------

PR: https://git.openjdk.org/jdk/pull/11294