JEP 132: More-prompt finalization

Fri May 29 17:20:49 UTC 2015

For what it's worth, I fully agree with David and Kirk around finalization not necessarily needing this treatment.

However, I was hoping this would have the effect of improving (non-finalizable) reference handling. We've seen serious issues in WeakReference handling and have had to write some twisted code to deal with this.

So I guess the question I have to Kirk and David is: do you feel a GC load of 10K WeakReferences per cycle is also "doing something else wrong"?

Sorry if this is going off-topic.

Thanks
Moh

>-----Original Message-----
>From: core-libs-dev [mailto:core-libs-dev-bounces at openjdk.java.net] On Behalf
>Of Kirk Pepperdine
>Sent: Thursday, May 28, 2015 11:58 PM
>To: david.holmes at oracle.com Holmes
>Cc: hotspot-gc-dev at openjdk.java.net openjdk.java.net; core-libs-
>dev at openjdk.java.net
>Subject: Re: JEP 132: More-prompt finalization
>
>Hi Peter,
>
>It is a very interesting proposal but to further David's comments, the life-
>cycle costs of reference objects is horrendous of which the actual process of
>finalizing an object is only a fraction of that total cost. Unfortunately your
>micro-benchmark only focuses on one aspect of that cost. In other words, it
>isn't very representative of a real concern. In the real world the finalizer
>*must compete with mutator threads and since F-J is an "all threads on deck"
>implementation, it doesn't play well with others. It creates a "tragedy of the
>commons". That is situations where everyone behaves rationally with a common
>resource but to the detriment of the whole group". In short, parallelizing (F-
>Jing) *everything* in an application is simply not a good idea. We do not live
>in an infinite compute environment which means to have to consider the impact
>of our actions to the entire group.
>
>This was one of the points of my recent article in Java Magazine which I wrote
>to try to counter some of the rhetoric I was hearing in conference about the
>universal benefits of being able easily parallelize streams in Java 8. Yes, I
>agree it's a great feature but it must be used with discretion. Case in point.
>After I finished writing the article, I started running into a couple of early
>adopters that had swallowed the parallel message whole indiscriminately
>parallelizing all of their streams. As you can imagine, they were quite
>surprised by the results and quickly worked to de-parallelize *all* of the
>streams in the application.
>
>To add some ability to parallelize the handling of reference objects seems
>like a good idea if you are collecting large numbers of reference objects
>(>10,000 per GC cycle). However if you are collecting large numbers of
>reference objects you're most likely doing something else wrong. IME,
>finalization is extremely useful but really only for a limited number of use
>cases and none of them (to date) have resulted in the app burning through
>1000s of final objects / sec.
>
>It would be interesting to know why why you picked on this particular issue.
>
>Kind regards,
>Kirk
>
>
>
>On May 29, 2015, at 5:18 AM, David Holmes <david.holmes at oracle.com> wrote:
>
>> Hi Peter,
>>
>> I guess I'm very concerned about the premise that finalization should scale
>to millions of objects and be performed highly concurrently. To me that's
>sending the wrong message about finalization. It also isn't the most effective
>use of cpu resources - most people would want to do useful work on most cpu's
>most of the time.
>>
>> Cheers,
>> David
>>
>> On 29/05/2015 3:12 AM, Peter Levart wrote:
>>> Hi,
>>>
>>> Did you know that the following simple loop:
>>>
>>> public class FinalizableBottleneck {
>>>     static boolean no;
>>>
>>>     @Override
>>>     protected void finalize() throws Throwable {
>>>         // empty finalize() method does not make the object finalizable
>>>         // (it is not even registered on the finalizer's list)
>>>         if (no) {
>>>             throw new AssertionError();
>>>         }
>>>     }
>>>
>>>     public static void main(String[] args) {
>>>         while (true) {
>>>             new FinalizableBottleneck();
>>>         }
>>>     }
>>> }
>>>
>>>
>>> ...quickly fills the entire heap with FinalizableBottleneck and internal
>>> Finalizer objects and brings the JVM to a halt? After a few seconds of
>>> running the above program, jmap -histo:live reports:
>>>
>>>  num     #instances         #bytes  class name
>>> ----------------------------------------------
>>>    1:      50048325     2001933000  java.lang.ref.Finalizer
>>>    2:      50048278      800772448  FinalizableBottleneck
>>>
>>>
>>> There are a couple of bottlenecks that make this happen:
>>>
>>> - ReferenceHandler thread synchronizes with VM to unhook Reference(s)
>>> from the pending chain one be one and dispatches them to their respected
>>> ReferenceQueue(s) which also use synchronization for equeueing each
>>> Reference.
>>> - Enqueueing synchronizes with the finalization thread which removes the
>>> Finalizer(s) (FinalReferences) from the finalization queue and executes
>>> them.
>>> - Executing the Finalizer(s) removes them from the doubly-linked list of
>>> all Finalizer(s) which is used to retain them until they are needed and
>>> this synchronizes with the threads that link new Finalizer(s) into the
>>> doubly-linked list as new finalizable objects get registered.
>>>
>>> We see that the creation of a finalizable object only takes one
>>> synchronization (registering into the doubly-linked list) and is
>>> performed synchronously, while finalization takes 4 synchronizations
>>> among 4 different threads (in pairs) and happens when the Finalizer
>>> instance "travels" over from VM thread to ReferenceHandler thread and
>>> then to finalization thread. No wonder that finalization can not keep up
>>> with allocation in a single thread. The situation is even worse when
>>> finalize() methods do some actual work.
>>>
>>> I have experimented with various approaches to widen these bottlenecks
>>> and found out that I can not beat the ForkJoinPool when combined with
>>> some improvements to internal data structures used in reference
>>> processing. Here's a prototype I came up with:
>>>
>>>
>http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/
>>>
>>>
>>> And this is the benchmark I use for measuring the throughput:
>>>
>>>
>http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThr
>oughput.java
>>>
>>>
>>> The benchmark shows (results inline in source) that using unpatched JDK,
>>> on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500
>>> finalizable objects per ms in a single thread and that while doing so,
>>> finalization only manages to process approx. 100 - 120 objects at the
>>> same time. Objects "in-flight" quickly accumulate and bring the VM to a
>>> halt, where it is not doing anything but full GC cycles.
>>>
>>> When constructing in 4 threads, there's not much difference.
>>> Construction of finalizable objects simply doesn't scale.
>>>
>>> Patched JDK shows something completely different. Single thread
>>> construction achieves a rate of 3600 objects / ms. Number of "in-flight"
>>> objects is kept constant at about 5-6M instances which amounts to approx
>>> 1.5 s of allocation. I think this is about the rate of GC cycles during
>>> which VM also processes the references. The benchmark also shows the
>>> ForkJoinPool statistics which shows that the number of queued tasks is
>>> also kept low.
>>>
>>> Increasing the allocation threads to 4 increases allocation rate to
>>> about 4300 objects / ms and finalization keeps up. Increasing allocation
>>> threads to 8, further increases allocation rate to about 4600 objects /
>>> ms and finalization still keeps up. The increase in rate is not linear,
>>> but keep in mind that i7 is a 4-core CPU.
>>>
>>> About the implementation...
>>>
>>> 1st improvement I did was for the doubly-linked list of Finalizer
>>> instances that is used to keep them alive until they are needed. I
>>> ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin
>>> Buchholz and just kept the internal link/unlink methods while
>>> specializing them to Finalizer entries (very straight-forward). I
>>> experimented with throughput and got some improvement, but throughput
>>> has increased much more when I used several instances of independent
>>> lists and distributed registrations among them randomly (unlinking
>>> consequently is also distributed randomly).
>>>
>>> I found out that no matter how hard I try to optimize ReferenceQueue
>>> while keeping the API unchanged, I can only do so much and that was not
>>> enough. I have been surprised by how well ForkJoinPool distributes tasks
>>> among threads, so I concluded that leveraging it is the best choice. I
>>> re-designed the pending-list unhooking loop to unhook pending references
>>> in chunks which greatly improves the throughput. Since unhooking can be
>>> performed by a single thread while holding a lock which is mandated by
>>> interface between VM and Java, I didn't employ multiple threads, but a
>>> single eternal ForkJoinTask that unhooks in chunks and forks-off other
>>> processing tasks that process chunks. When there are just a couple of
>>> References pending at one time and a not-full chunk is unhooked, then
>>> the processing is performed by the same thread that unhooked the
>>> refrences, but when there are more, worker tasks are forked off and the
>>> unhooking thread continues with full peace. This processing includes
>>> execution of Cleaners, forking the finalizer tasks and enqueue-ing other
>>> references. Finalizer(s) are always executed as separate ForkJoinTask(s).
>>>
>>> It's interesting how Runtime.runFinalizers() is implemented in this
>>> patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...
>>>
>>> I also tweaked the ReferenceQueue implementation a bit (it is still used
>>> for other kinds of references) so that it avoids synchronization with a
>>> monitor lock when there are no blocking waiters and uses CAS to
>>> enqueue/dequeue. This improves throughput when the queue is not empty.
>>> Since in the prototype multiple threads can enqueue into the same queue,
>>> I thought this would improve throughput in such situations.
>>>
>>> Comments, suggestions, criticism are welcome.
>>>
>>> Regards, Peter
>>>