The common ForkJoinPool does not have any ForkJoinWorkerThread while tasks are submitted to the queue

Mon Feb 12 15:13:08 UTC 2024

Hi,

Could the problem have occurred because the ForkJoinPool got an OOME when it tried to allocate a ForkJoinWorkerThread?

To check for that, if you're using the commonPool(), you might be able to add a custom ForkJoinWorkerThreadFactory via passing in -Djava.util.concurrent.ForkJoinPool.common.threadFactory=<insert fqcn of custom factory here> and implement newThread() such that you try-catch OOME and log it from there.

Cheers,
√

Viktor Klang
Software Architect, Java Platform Group
Oracle
________________________________
From: core-libs-dev <core-libs-dev-retn at openjdk.org> on behalf of Xiao Yu <cutefish.yx at gmail.com>
Sent: Saturday, 3 February 2024 19:54
To: Jaikiran Pai <jai.forums2013 at gmail.com>
Cc: core-libs-dev at openjdk.org <core-libs-dev at openjdk.org>
Subject: Re: The common ForkJoinPool does not have any ForkJoinWorkerThread while tasks are submitted to the queue

Hi Jaikiran,

Thanks a lot for replying.

Our application is a client that communicates to the server for
request/response. The client creates a secure (TLS) connection to the server,
that is, on top of the SocketChannel, we implement a Wrapper class called
SSLDataChannel for reading and writing. The SSLDataChannel uses the
javax.net.ssl.SSLEngine. Before any read and write can happen, we need to do
SSL handshakes by calling methods in SSLEngine. One of the methods is
SSLEngine#getDelegatedTask(). The returned task needs to be executed before the
handshake can proceed. After the task is done, we need to continue processing
read and write events on the connection. The connection read and write events
are all handled by a class called NioEndpointHandler. One requirement for our
client is that it supports an asynchronous API and therefore the whole stack
must all implement non-blocking methods. The tasks from the SSLEngine could
take a long time and we do not want them to block our other connection events,
and this is when the ForkJoinPool is used. We run the SSL tasks in the
ForkJoinPool and after the task is done we arrange to run the
NioEndpointHandler callbacks to proceed with the read and write events. The
much simplified code looks somewhat like the following.

```
class NioEndpointHandler {

    /** The ssl channel */
    private final SSLDataChannel sslDataChannel;
    /** The runnable to execute to handle read after ssl tasks is done. */
    private final Runnable handleReadAfterSSLTask = () -> onRead();
    /** The handler state. */
    State state;

    /** Executes the SSL tasks until no task to run, then run the callback. */
    private void executeSSLTask(ExecutorService executor, Runnable callback) {
        executor.submit(() -> {
            Runnable task;
            while ((task = sslDataChannel.getSSLEngine().getDelegatedTask()) != null) {
                task.run();
            }
            try {
                callback.run();
            } catch (Throwable t) {
                /* logging the exception. */
            }
        });
    }

    /** Handle a read event. */
    private void onRead() {
        if (sslDataChannel.needsHandshake()) {
            /* do handshake */

            /* One of the handshake step is to check if there is any SSL task to run. */
            if (sslDataChannel.needExecuteTask()) {
                executeSSLTask(ForkJoinPool.commonPool(), handleReadAfterSSLTask);
            }
        }
    }

    private void terminate() {
        state = TERMINATED;

        /* Other clean up tasks, however, tasks submitted to the ForkJoinPool are not cancelled. */
    }
}
```

> What are these handlers? Are they classes which implement Runnable or
> are they something else? What does termination of handler mean in this
> context? Do you use any java.util.concurrent.* APIs to "cancel" such
> terminated handlers?

The much simplified handler code please see above.

The tasks submitted to the ForkJoinPool queue are Runnables that are fields to
the NioEndpointHandler. What we have observed is that there are a lot of tasks
in the fork join pool that have a reference to the lambda inside
NioEndpointHandler#executeSSLTask which eventually have a reference to the
NioEndpointHandler. Those NioEndpointHandler are in the TERMINATED state. The
only reference to those NioEndpointHandler are these tasks or otherwise they
can be garbage collected after the termination cleans up all the other
references.

Termination of the handler means those connections are at the end of their life
cycle. We clean up things such as signal end of life cycle for all the
associated request/response pairs and closing the SSLDataChannel, etc.

No, we have not use the cancel method to cancel the submitted tasks. I agree
that this is an oversight and it would be cleaner to cancel them. However, my
current theory is that this is not the root cause. From my understanding of the
code, the cancel method only changes the state of the task. It does not remove
the task from the queue of the ForkJoinPool. Therefore, those tasks, even if
got cancelled, would still stay in the queue preventing the terminated
NioEndpointHandler from being garbage collected. Currently, I am strongly
biased to my own theory that somehow there is no ForkJoinPool thread that
polling tasks out of the queue and I am trying to use the ctl field in the
ForkJoinPool as the evidence to backup my theory. I am wondering if I am making
some mistake with my theory.

> Finally, what does the OutOfMemoryError exception stacktrace look like
> and what is the JVM parameters (heap size for example) used to launch
> this application?

Our clients creates about 155 threads and quite a lot of them have OOME on
their stack. I am not quite sure how to reply to this question. Going through
the stack traces, I do not find anything very suspicious. They are just
exercising their most frequent code path: some I/O threads waiting for I/O
events and some execution threads waiting for more work to do, etc.

It is worth mentioning that there is no ForkJoinPoolWorkerThread stacks in the
thread dump from the heap dump. From my understanding, the only time when there
is no such thread is when there is no tasks to run. But there are quite a lot
of tasks in the queue.

Here are our JVM arguments:

```
-Xms1G
-Xmx1G
-Djava.util.logging.config.file=/var/lib/andc/config/params/sender.logging.properties
-Djavax.net.ssl.trustStore=/var/lib/andc/wallet/client.trust
-Doracle.kv.security=/var/lib/andc/config/security/login.properties
-Doci.javasdk.extra.stream.logs.enabled=false
-XX:G1HeapRegionSize=32m
-XX:+DisableExplicitGC
-Xlog:all=warning,gc*=info,safepoint=info:file=/var/lib/andc/log/sender/sender.gc:utctime:filecount=10,filesize=10000000
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/andc/log/sender/
```

We have creation and termination timestamps in the NioEndpointHandler object.
From what I can see in the heap dump, the SSL tasks in the ForkJoinPool are
associated with NioEndpointHandler that are created at an interval on the
magnitude of seconds (retry attempt with second-magnitude backoff). Each
NioEndpointHandler are terminated after a fixed 5-second timeout due to unable
to connect. The time span for those NioEndpointHandler is about 2 hours. This
creates
```
2 hours * 3600 seconds / hour * 1 NioEndpointHandler / second  * 1 SSLDataChannel / NioEndpointHandler * 65K bytes / SSLDataChannel ~= 468M bytes.
```
With 1G heap size, this eventually caused OOME.

We are adding fixes so that the SSL tasks would not preventing the
NioEndpointHandler from being garbage collected. However, the root cause is
still a mystery and I am wondering if I am on the right tracker to figure it
out.

Thanks a lot for your time and patience.

Xiao Yu

On Fri, Feb 2, 2024 at 5:35 AM Jaikiran Pai <jai.forums2013 at gmail.com<mailto:jai.forums2013 at gmail.com>> wrote:
Hello Xiao,

I don't have enough knowledge of this area to provide any insight into
the issue. However, just to try and get the discussion started, do you
have any sample code of your application which shows how the application
uses the ForkJoinPool? More specifically what APIs do you use in the
application?

Few other questions inline below.

On 12/01/24 11:30 am, Xiao Yu wrote:
> ....
> Here is the full background. One of our process experienced an OOME
> and a heap
> dump was obtained. We know there was a concurrent issue of our system
> happening
> on some other machines such that network failure and retries occurred
> in this
> process at the same time. Upon analyzing the heap dump, we observed a
> lot of
> our network connection handlers being frequently created and
> terminated which
> is expected due to the network failure and retry attempts mentioned above.
> However, those terminated handlers are not being GC'ed because of
> there were
> references to tasks submitted to the ForkJoinPool during the connection
> attempts. The tasks stayed in the queue until OOME happened as there is no
> threads to execute them.

What are these handlers? Are they classes which implement Runnable or
are they something else? What does termination of handler mean in this
context? Do you use any java.util.concurrent.* APIs to "cancel" such
terminated handlers?

Finally, what does the OutOfMemoryError exception stacktrace look like
and what is the JVM parameters (heap size for example) used to launch
this application?

-Jaikiran
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20240212/0960c359/attachment-0001.htm>