<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
<body dir="ltr">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">Could the problem have occurred because the ForkJoinPool got an OOME when it tried to
allocate a ForkJoinWorkerThread?<br>
To check for that, if you're using the commonPool(), you might be able to add a custom ForkJoinWorkerThreadFactory via passing in -Djava.util.concurrent.ForkJoinPool.common.threadFactory=<insert fqcn of custom factory here> and implement newThread() such that
you try-catch OOME and log it from there.</span></div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<div id="Signature">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<b>Viktor Klang</b></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Software Architect, Java Platform Group<br>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> core-libs-dev <core-libs-dev-retn@openjdk.org> on behalf of Xiao Yu <cutefish.yx@gmail.com><br>
<b>Sent:</b> Saturday, 3 February 2024 19:54<br>
<b>To:</b> Jaikiran Pai <jai.forums2013@gmail.com><br>
<b>Cc:</b> core-libs-dev@openjdk.org <core-libs-dev@openjdk.org><br>
<b>Subject:</b> Re: The common ForkJoinPool does not have any ForkJoinWorkerThread while tasks are submitted to the queue</font>
<div> </div>
<div dir="ltr">
<div>Hi Jaikiran,</div>
<div>Thanks a lot for replying.<br>
Our application is a client that communicates to the server for<br>
request/response. The client creates a secure (TLS) connection to the server,<br>
that is, on top of the SocketChannel, we implement a Wrapper class called<br>
SSLDataChannel for reading and writing. The SSLDataChannel uses the<br>
javax.net.ssl.SSLEngine. Before any read and write can happen, we need to do<br>
SSL handshakes by calling methods in SSLEngine. One of the methods is<br>
SSLEngine#getDelegatedTask(). The returned task needs to be executed before the<br>
handshake can proceed. After the task is done, we need to continue processing<br>
read and write events on the connection. The connection read and write events<br>
are all handled by a class called NioEndpointHandler. One requirement for our<br>
client is that it supports an asynchronous API and therefore the whole stack<br>
must all implement non-blocking methods. The tasks from the SSLEngine could<br>
take a long time and we do not want them to block our other connection events,<br>
and this is when the ForkJoinPool is used. We run the SSL tasks in the<br>
ForkJoinPool and after the task is done we arrange to run the<br>
NioEndpointHandler callbacks to proceed with the read and write events. The<br>
much simplified code looks somewhat like the following.<br>
class NioEndpointHandler {<br>
/** The ssl channel */<br>
private final SSLDataChannel sslDataChannel;<br>
/** The runnable to execute to handle read after ssl tasks is done. */<br>
private final Runnable handleReadAfterSSLTask = () -> onRead();<br>
/** The handler state. */<br>
State state;<br>
/** Executes the SSL tasks until no task to run, then run the callback. */<br>
private void executeSSLTask(ExecutorService executor, Runnable callback) {<br>
executor.submit(() -> {<br>
Runnable task;<br>
while ((task = sslDataChannel.getSSLEngine().getDelegatedTask()) != null) {<br>
try {<br>
} catch (Throwable t) {<br>
<div> /* logging the exception. */<br>
<div> }<br>
/** Handle a read event. */<br>
private void onRead() {<br>
if (sslDataChannel.needsHandshake()) {<br>
/* do handshake */<br>
/* One of the handshake step is to check if there is any SSL task to run. */<br>
if (sslDataChannel.needExecuteTask()) {<br>
executeSSLTask(ForkJoinPool.commonPool(), handleReadAfterSSLTask);<br>
private void terminate() {<br>
state = TERMINATED;<br>
/* Other clean up tasks, however, tasks submitted to the ForkJoinPool are not cancelled. */<br>
> What are these handlers? Are they classes which implement Runnable or<br>
> are they something else? What does termination of handler mean in this<br>
> context? Do you use any java.util.concurrent.* APIs to "cancel" such<br>
> terminated handlers?<br>
The much simplified handler code please see above.<br>
The tasks submitted to the ForkJoinPool queue are Runnables that are fields to<br>
the NioEndpointHandler. What we have observed is that there are a lot of tasks<br>
in the fork join pool that have a reference to the lambda inside<br>
NioEndpointHandler#executeSSLTask which eventually have a reference to the<br>
NioEndpointHandler. Those NioEndpointHandler are in the TERMINATED state. The<br>
only reference to those NioEndpointHandler are these tasks or otherwise they<br>
can be garbage collected after the termination cleans up all the other<br>
<div>Termination of the handler means those connections are at the end of their life<br>
cycle. We clean up things such as signal end of life cycle for all the<br>
associated request/response pairs and closing the SSLDataChannel, etc.<br>
No, we have not use the cancel method to cancel the submitted tasks. I agree<br>
that this is an oversight and it would be cleaner to cancel them. However, my<br>
current theory is that this is not the root cause. From my understanding of the<br>
code, the cancel method only changes the state of the task. It does not remove<br>
the task from the queue of the ForkJoinPool. Therefore, those tasks, even if<br>
got cancelled, would still stay in the queue preventing the terminated<br>
NioEndpointHandler from being garbage collected. Currently, I am strongly<br>
biased to my own theory that somehow there is no ForkJoinPool thread that<br>
polling tasks out of the queue and I am trying to use the ctl field in the<br>
ForkJoinPool as the evidence to backup my theory. I am wondering if I am making<br>
some mistake with my theory.<br>
> Finally, what does the OutOfMemoryError exception stacktrace look like<br>
> and what is the JVM parameters (heap size for example) used to launch<br>
> this application?<br>
Our clients creates about 155 threads and quite a lot of them have OOME on<br>
their stack. I am not quite sure how to reply to this question. Going through<br>
the stack traces, I do not find anything very suspicious. They are just<br>
exercising their most frequent code path: some I/O threads waiting for I/O<br>
events and some execution threads waiting for more work to do, etc.<br>
It is worth mentioning that there is no ForkJoinPoolWorkerThread stacks in the<br>
thread dump from the heap dump. From my understanding, the only time when there<br>
is no such thread is when there is no tasks to run. But there are quite a lot<br>
of tasks in the queue.<br>
Here are our JVM arguments:<br>
We have creation and termination timestamps in the NioEndpointHandler object.<br>
From what I can see in the heap dump, the SSL tasks in the ForkJoinPool are<br>
associated with NioEndpointHandler that are created at an interval on the<br>
magnitude of seconds (retry attempt with second-magnitude backoff). Each<br>
NioEndpointHandler are terminated after a fixed 5-second timeout due to unable<br>
to connect. The time span for those NioEndpointHandler is about 2 hours. This<br>
2 hours * 3600 seconds / hour * 1 NioEndpointHandler / second * 1 SSLDataChannel / NioEndpointHandler * 65K bytes / SSLDataChannel ~= 468M bytes.<br>
With 1G heap size, this eventually caused OOME.<br>
We are adding fixes so that the SSL tasks would not preventing the<br>
NioEndpointHandler from being garbage collected. However, the root cause is<br>
still a mystery and I am wondering if I am on the right tracker to figure it<br>
Thanks a lot for your time and patience.<br>
<div dir="ltr" class="x_gmail_signature" data-smartmail="gmail_signature">Xiao Yu</div>
<div class="x_gmail_quote">
<div dir="ltr" class="x_gmail_attr">On Fri, Feb 2, 2024 at 5:35 AM Jaikiran Pai <<a href="mailto:jai.forums2013@gmail.com">jai.forums2013@gmail.com</a>> wrote:<br>
<blockquote class="x_gmail_quote" style="margin:0px 0px 0px 0.8ex; border-left:1px solid rgb(204,204,204); padding-left:1ex">
Hello Xiao,<br>
I don't have enough knowledge of this area to provide any insight into <br>
the issue. However, just to try and get the discussion started, do you <br>
have any sample code of your application which shows how the application <br>
uses the ForkJoinPool? More specifically what APIs do you use in the <br>
Few other questions inline below.<br>
On 12/01/24 11:30 am, Xiao Yu wrote:<br>
> ....<br>
> Here is the full background. One of our process experienced an OOME <br>
> and a heap<br>
> dump was obtained. We know there was a concurrent issue of our system <br>
> happening<br>
> on some other machines such that network failure and retries occurred <br>
> in this<br>
> process at the same time. Upon analyzing the heap dump, we observed a <br>
> lot of<br>
> our network connection handlers being frequently created and <br>
> terminated which<br>
> is expected due to the network failure and retry attempts mentioned above.<br>
> However, those terminated handlers are not being GC'ed because of <br>
> there were<br>
> references to tasks submitted to the ForkJoinPool during the connection<br>
> attempts. The tasks stayed in the queue until OOME happened as there is no<br>
> threads to execute them.<br>
What are these handlers? Are they classes which implement Runnable or <br>
are they something else? What does termination of handler mean in this <br>
context? Do you use any java.util.concurrent.* APIs to "cancel" such <br>
terminated handlers?<br>
Finally, what does the OutOfMemoryError exception stacktrace look like <br>
and what is the JVM parameters (heap size for example) used to launch <br>
this application?<br>