JEP 132: More-prompt finalization

Thu May 28 17:12:14 UTC 2015

Hi,

Did you know that the following simple loop:

public class FinalizableBottleneck {
     static boolean no;

     @Override
     protected void finalize() throws Throwable {
         // empty finalize() method does not make the object finalizable
         // (it is not even registered on the finalizer's list)
         if (no) {
             throw new AssertionError();
         }
     }

     public static void main(String[] args) {
         while (true) {
             new FinalizableBottleneck();
         }
     }
}

...quickly fills the entire heap with FinalizableBottleneck and internal 
Finalizer objects and brings the JVM to a halt? After a few seconds of 
running the above program, jmap -histo:live reports:

  num     #instances         #bytes  class name
----------------------------------------------
    1:      50048325     2001933000  java.lang.ref.Finalizer
    2:      50048278      800772448  FinalizableBottleneck

There are a couple of bottlenecks that make this happen:

- ReferenceHandler thread synchronizes with VM to unhook Reference(s) 
from the pending chain one be one and dispatches them to their respected 
ReferenceQueue(s) which also use synchronization for equeueing each 
Reference.
- Enqueueing synchronizes with the finalization thread which removes the 
Finalizer(s) (FinalReferences) from the finalization queue and executes 
them.
- Executing the Finalizer(s) removes them from the doubly-linked list of 
all Finalizer(s) which is used to retain them until they are needed and 
this synchronizes with the threads that link new Finalizer(s) into the 
doubly-linked list as new finalizable objects get registered.

We see that the creation of a finalizable object only takes one 
synchronization (registering into the doubly-linked list) and is 
performed synchronously, while finalization takes 4 synchronizations 
among 4 different threads (in pairs) and happens when the Finalizer 
instance "travels" over from VM thread to ReferenceHandler thread and 
then to finalization thread. No wonder that finalization can not keep up 
with allocation in a single thread. The situation is even worse when 
finalize() methods do some actual work.

I have experimented with various approaches to widen these bottlenecks 
and found out that I can not beat the ForkJoinPool when combined with 
some improvements to internal data structures used in reference 
processing. Here's a prototype I came up with:

http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/

And this is the benchmark I use for measuring the throughput:

http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThroughput.java

The benchmark shows (results inline in source) that using unpatched JDK, 
on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500 
finalizable objects per ms in a single thread and that while doing so, 
finalization only manages to process approx. 100 - 120 objects at the 
same time. Objects "in-flight" quickly accumulate and bring the VM to a 
halt, where it is not doing anything but full GC cycles.

When constructing in 4 threads, there's not much difference. 
Construction of finalizable objects simply doesn't scale.

Patched JDK shows something completely different. Single thread 
construction achieves a rate of 3600 objects / ms. Number of "in-flight" 
objects is kept constant at about 5-6M instances which amounts to approx 
1.5 s of allocation. I think this is about the rate of GC cycles during 
which VM also processes the references. The benchmark also shows the 
ForkJoinPool statistics which shows that the number of queued tasks is 
also kept low.

Increasing the allocation threads to 4 increases allocation rate to 
about 4300 objects / ms and finalization keeps up. Increasing allocation 
threads to 8, further increases allocation rate to about 4600 objects / 
ms and finalization still keeps up. The increase in rate is not linear, 
but keep in mind that i7 is a 4-core CPU.

About the implementation...

1st improvement I did was for the doubly-linked list of Finalizer 
instances that is used to keep them alive until they are needed. I 
ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin 
Buchholz and just kept the internal link/unlink methods while 
specializing them to Finalizer entries (very straight-forward). I 
experimented with throughput and got some improvement, but throughput 
has increased much more when I used several instances of independent 
lists and distributed registrations among them randomly (unlinking 
consequently is also distributed randomly).

I found out that no matter how hard I try to optimize ReferenceQueue 
while keeping the API unchanged, I can only do so much and that was not 
enough. I have been surprised by how well ForkJoinPool distributes tasks 
among threads, so I concluded that leveraging it is the best choice. I 
re-designed the pending-list unhooking loop to unhook pending references 
in chunks which greatly improves the throughput. Since unhooking can be 
performed by a single thread while holding a lock which is mandated by 
interface between VM and Java, I didn't employ multiple threads, but a 
single eternal ForkJoinTask that unhooks in chunks and forks-off other 
processing tasks that process chunks. When there are just a couple of 
References pending at one time and a not-full chunk is unhooked, then 
the processing is performed by the same thread that unhooked the 
refrences, but when there are more, worker tasks are forked off and the 
unhooking thread continues with full peace. This processing includes 
execution of Cleaners, forking the finalizer tasks and enqueue-ing other 
references. Finalizer(s) are always executed as separate ForkJoinTask(s).

It's interesting how Runtime.runFinalizers() is implemented in this 
patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...

I also tweaked the ReferenceQueue implementation a bit (it is still used 
for other kinds of references) so that it avoids synchronization with a 
monitor lock when there are no blocking waiters and uses CAS to 
enqueue/dequeue. This improves throughput when the queue is not empty. 
Since in the prototype multiple threads can enqueue into the same queue, 
I thought this would improve throughput in such situations.

Comments, suggestions, criticism are welcome.

Regards, Peter