JEP 132: More-prompt finalization
Peter Levart
peter.levart at gmail.com
Thu May 28 17:12:14 UTC 2015
Hi,
Did you know that the following simple loop:
public class FinalizableBottleneck {
static boolean no;
@Override
protected void finalize() throws Throwable {
// empty finalize() method does not make the object finalizable
// (it is not even registered on the finalizer's list)
if (no) {
throw new AssertionError();
}
}
public static void main(String[] args) {
while (true) {
new FinalizableBottleneck();
}
}
}
...quickly fills the entire heap with FinalizableBottleneck and internal
Finalizer objects and brings the JVM to a halt? After a few seconds of
running the above program, jmap -histo:live reports:
num #instances #bytes class name
----------------------------------------------
1: 50048325 2001933000 java.lang.ref.Finalizer
2: 50048278 800772448 FinalizableBottleneck
There are a couple of bottlenecks that make this happen:
- ReferenceHandler thread synchronizes with VM to unhook Reference(s)
from the pending chain one be one and dispatches them to their respected
ReferenceQueue(s) which also use synchronization for equeueing each
Reference.
- Enqueueing synchronizes with the finalization thread which removes the
Finalizer(s) (FinalReferences) from the finalization queue and executes
them.
- Executing the Finalizer(s) removes them from the doubly-linked list of
all Finalizer(s) which is used to retain them until they are needed and
this synchronizes with the threads that link new Finalizer(s) into the
doubly-linked list as new finalizable objects get registered.
We see that the creation of a finalizable object only takes one
synchronization (registering into the doubly-linked list) and is
performed synchronously, while finalization takes 4 synchronizations
among 4 different threads (in pairs) and happens when the Finalizer
instance "travels" over from VM thread to ReferenceHandler thread and
then to finalization thread. No wonder that finalization can not keep up
with allocation in a single thread. The situation is even worse when
finalize() methods do some actual work.
I have experimented with various approaches to widen these bottlenecks
and found out that I can not beat the ForkJoinPool when combined with
some improvements to internal data structures used in reference
processing. Here's a prototype I came up with:
http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/
And this is the benchmark I use for measuring the throughput:
http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThroughput.java
The benchmark shows (results inline in source) that using unpatched JDK,
on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500
finalizable objects per ms in a single thread and that while doing so,
finalization only manages to process approx. 100 - 120 objects at the
same time. Objects "in-flight" quickly accumulate and bring the VM to a
halt, where it is not doing anything but full GC cycles.
When constructing in 4 threads, there's not much difference.
Construction of finalizable objects simply doesn't scale.
Patched JDK shows something completely different. Single thread
construction achieves a rate of 3600 objects / ms. Number of "in-flight"
objects is kept constant at about 5-6M instances which amounts to approx
1.5 s of allocation. I think this is about the rate of GC cycles during
which VM also processes the references. The benchmark also shows the
ForkJoinPool statistics which shows that the number of queued tasks is
also kept low.
Increasing the allocation threads to 4 increases allocation rate to
about 4300 objects / ms and finalization keeps up. Increasing allocation
threads to 8, further increases allocation rate to about 4600 objects /
ms and finalization still keeps up. The increase in rate is not linear,
but keep in mind that i7 is a 4-core CPU.
About the implementation...
1st improvement I did was for the doubly-linked list of Finalizer
instances that is used to keep them alive until they are needed. I
ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin
Buchholz and just kept the internal link/unlink methods while
specializing them to Finalizer entries (very straight-forward). I
experimented with throughput and got some improvement, but throughput
has increased much more when I used several instances of independent
lists and distributed registrations among them randomly (unlinking
consequently is also distributed randomly).
I found out that no matter how hard I try to optimize ReferenceQueue
while keeping the API unchanged, I can only do so much and that was not
enough. I have been surprised by how well ForkJoinPool distributes tasks
among threads, so I concluded that leveraging it is the best choice. I
re-designed the pending-list unhooking loop to unhook pending references
in chunks which greatly improves the throughput. Since unhooking can be
performed by a single thread while holding a lock which is mandated by
interface between VM and Java, I didn't employ multiple threads, but a
single eternal ForkJoinTask that unhooks in chunks and forks-off other
processing tasks that process chunks. When there are just a couple of
References pending at one time and a not-full chunk is unhooked, then
the processing is performed by the same thread that unhooked the
refrences, but when there are more, worker tasks are forked off and the
unhooking thread continues with full peace. This processing includes
execution of Cleaners, forking the finalizer tasks and enqueue-ing other
references. Finalizer(s) are always executed as separate ForkJoinTask(s).
It's interesting how Runtime.runFinalizers() is implemented in this
patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...
I also tweaked the ReferenceQueue implementation a bit (it is still used
for other kinds of references) so that it avoids synchronization with a
monitor lock when there are no blocking waiters and uses CAS to
enqueue/dequeue. This improves throughput when the queue is not empty.
Since in the prototype multiple threads can enqueue into the same queue,
I thought this would improve throughput in such situations.
Comments, suggestions, criticism are welcome.
Regards, Peter
More information about the core-libs-dev
mailing list