The common ForkJoinPool does not have any ForkJoinWorkerThread while tasks are submitted to the queue

Sat Feb 3 18:54:25 UTC 2024

Hi Jaikiran,

Thanks a lot for replying.

Our application is a client that communicates to the server for
request/response. The client creates a secure (TLS) connection to the
server,
that is, on top of the SocketChannel, we implement a Wrapper class called
SSLDataChannel for reading and writing. The SSLDataChannel uses the
javax.net.ssl.SSLEngine. Before any read and write can happen, we need to do
SSL handshakes by calling methods in SSLEngine. One of the methods is
SSLEngine#getDelegatedTask(). The returned task needs to be executed before
the
handshake can proceed. After the task is done, we need to continue
processing
read and write events on the connection. The connection read and write
events
are all handled by a class called NioEndpointHandler. One requirement for
our
client is that it supports an asynchronous API and therefore the whole stack
must all implement non-blocking methods. The tasks from the SSLEngine could
take a long time and we do not want them to block our other connection
events,
and this is when the ForkJoinPool is used. We run the SSL tasks in the
ForkJoinPool and after the task is done we arrange to run the
NioEndpointHandler callbacks to proceed with the read and write events. The
much simplified code looks somewhat like the following.

```
class NioEndpointHandler {

    /** The ssl channel */
    private final SSLDataChannel sslDataChannel;
    /** The runnable to execute to handle read after ssl tasks is done. */
    private final Runnable handleReadAfterSSLTask = () -> onRead();
    /** The handler state. */
    State state;

    /** Executes the SSL tasks until no task to run, then run the callback.
*/
    private void executeSSLTask(ExecutorService executor, Runnable
callback) {
        executor.submit(() -> {
            Runnable task;
            while ((task =
sslDataChannel.getSSLEngine().getDelegatedTask()) != null) {
                task.run();
            }
            try {
                callback.run();
            } catch (Throwable t) {
                /* logging the exception. */
            }
        });
    }

    /** Handle a read event. */
    private void onRead() {
        if (sslDataChannel.needsHandshake()) {
            /* do handshake */

            /* One of the handshake step is to check if there is any SSL
task to run. */
            if (sslDataChannel.needExecuteTask()) {
                executeSSLTask(ForkJoinPool.commonPool(),
handleReadAfterSSLTask);
            }
        }
    }

    private void terminate() {
        state = TERMINATED;

        /* Other clean up tasks, however, tasks submitted to the
ForkJoinPool are not cancelled. */
    }
}
```

> What are these handlers? Are they classes which implement Runnable or
> are they something else? What does termination of handler mean in this
> context? Do you use any java.util.concurrent.* APIs to "cancel" such
> terminated handlers?

The much simplified handler code please see above.

The tasks submitted to the ForkJoinPool queue are Runnables that are fields
to
the NioEndpointHandler. What we have observed is that there are a lot of
tasks
in the fork join pool that have a reference to the lambda inside
NioEndpointHandler#executeSSLTask which eventually have a reference to the
NioEndpointHandler. Those NioEndpointHandler are in the TERMINATED state.
The
only reference to those NioEndpointHandler are these tasks or otherwise they
can be garbage collected after the termination cleans up all the other
references.

Termination of the handler means those connections are at the end of their
life
cycle. We clean up things such as signal end of life cycle for all the
associated request/response pairs and closing the SSLDataChannel, etc.

No, we have not use the cancel method to cancel the submitted tasks. I agree
that this is an oversight and it would be cleaner to cancel them. However,
my
current theory is that this is not the root cause. From my understanding of
the
code, the cancel method only changes the state of the task. It does not
remove
the task from the queue of the ForkJoinPool. Therefore, those tasks, even if
got cancelled, would still stay in the queue preventing the terminated
NioEndpointHandler from being garbage collected. Currently, I am strongly
biased to my own theory that somehow there is no ForkJoinPool thread that
polling tasks out of the queue and I am trying to use the ctl field in the
ForkJoinPool as the evidence to backup my theory. I am wondering if I am
making
some mistake with my theory.

> Finally, what does the OutOfMemoryError exception stacktrace look like
> and what is the JVM parameters (heap size for example) used to launch
> this application?

Our clients creates about 155 threads and quite a lot of them have OOME on
their stack. I am not quite sure how to reply to this question. Going
through
the stack traces, I do not find anything very suspicious. They are just
exercising their most frequent code path: some I/O threads waiting for I/O
events and some execution threads waiting for more work to do, etc.

It is worth mentioning that there is no ForkJoinPoolWorkerThread stacks in
the
thread dump from the heap dump. From my understanding, the only time when
there
is no such thread is when there is no tasks to run. But there are quite a
lot
of tasks in the queue.

Here are our JVM arguments:

```
-Xms1G
-Xmx1G
-Djava.util.logging.config.file=/var/lib/andc/config/params/sender.logging.properties
-Djavax.net.ssl.trustStore=/var/lib/andc/wallet/client.trust
-Doracle.kv.security=/var/lib/andc/config/security/login.properties
-Doci.javasdk.extra.stream.logs.enabled=false
-XX:G1HeapRegionSize=32m
-XX:+DisableExplicitGC
-Xlog:all=warning,gc*=info,safepoint=info:file=/var/lib/andc/log/sender/sender.gc:utctime:filecount=10,filesize=10000000
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/andc/log/sender/
```

We have creation and termination timestamps in the NioEndpointHandler
object.
>From what I can see in the heap dump, the SSL tasks in the ForkJoinPool are
associated with NioEndpointHandler that are created at an interval on the
magnitude of seconds (retry attempt with second-magnitude backoff). Each
NioEndpointHandler are terminated after a fixed 5-second timeout due to
unable
to connect. The time span for those NioEndpointHandler is about 2 hours.
This
creates
```
2 hours * 3600 seconds / hour * 1 NioEndpointHandler / second  * 1
SSLDataChannel / NioEndpointHandler * 65K bytes / SSLDataChannel ~= 468M
bytes.
```
With 1G heap size, this eventually caused OOME.

We are adding fixes so that the SSL tasks would not preventing the
NioEndpointHandler from being garbage collected. However, the root cause is
still a mystery and I am wondering if I am on the right tracker to figure it
out.

Thanks a lot for your time and patience.

Xiao Yu

On Fri, Feb 2, 2024 at 5:35 AM Jaikiran Pai <jai.forums2013 at gmail.com>
wrote:

> Hello Xiao,
>
> I don't have enough knowledge of this area to provide any insight into
> the issue. However, just to try and get the discussion started, do you
> have any sample code of your application which shows how the application
> uses the ForkJoinPool? More specifically what APIs do you use in the
> application?
>
> Few other questions inline below.
>
> On 12/01/24 11:30 am, Xiao Yu wrote:
> > ....
> > Here is the full background. One of our process experienced an OOME
> > and a heap
> > dump was obtained. We know there was a concurrent issue of our system
> > happening
> > on some other machines such that network failure and retries occurred
> > in this
> > process at the same time. Upon analyzing the heap dump, we observed a
> > lot of
> > our network connection handlers being frequently created and
> > terminated which
> > is expected due to the network failure and retry attempts mentioned
> above.
> > However, those terminated handlers are not being GC'ed because of
> > there were
> > references to tasks submitted to the ForkJoinPool during the connection
> > attempts. The tasks stayed in the queue until OOME happened as there is
> no
> > threads to execute them.
>
> What are these handlers? Are they classes which implement Runnable or
> are they something else? What does termination of handler mean in this
> context? Do you use any java.util.concurrent.* APIs to "cancel" such
> terminated handlers?
>
> Finally, what does the OutOfMemoryError exception stacktrace look like
> and what is the JVM parameters (heap size for example) used to launch
> this application?
>
> -Jaikiran
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20240203/737ac9fd/attachment-0001.htm>