RFR: 8055232 (ref) Exceptions while processing Reference pending list
Hi, This story has a long tail. It started with: https://bugs.openjdk.java.net/browse/JDK-7038914 Some stress tests triggered OOME in ReferenceHandler thread which would die. The first attempt at fixing this was the following discussion: https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg16250.html Which resulted in patch: http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/0b8dab7fec54 This assumed that ReferenceHandler thread doing Object.wait() could be interrupted (by stress test - normal application don't do that) and failed to allocate InterruptedException object. A jtreg test was designed which triggered that situation and a fix would catch OOME and ignore it. But the stress tests (the same or some other, I don't know) apparently were not entirely happy with this fix. The following discussion describes this: https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg23596.html The other "unprotected" point at which OOME could be thrown and was later confirmed by debugging is (r instanceof Cleaner) test. The assumption was that it could trigger Cleaner class loading at 1st execution which would cause OOME to be thrown. The fix that finally silenced stress tests was the following: http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d04102f69d46 This part of code (the j.l.r.Reference class and its members) has undergone further changes afterwards for fixing another bug: https://bugs.openjdk.java.net/browse/JDK-6857566 With following patch: http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/9934d34ed3c0 But this did not change the code paths - just factored-out the content of the loop into a separate method that could be used from outside. All well until kim.barrett@oracle.com noticed that the 2nd fix introduced a potentially illegal situation. There is a j.l.r.Reference.Lock inner class and a singleton object assigned to static field in j.l.Reference class with the following notice: /* Object used to synchronize with the garbage collector. The collector * must acquire this lock at the beginning of each collection cycle. It is * therefore critical that any code holding this lock complete as quickly * as possible, allocate no new objects, and avoid calling user code. */ static private class Lock { } private static Lock lock = new Lock(); The conflicting part is "allocate no new objects". Catching OOME inside a synchronized block holding this lock implies that new objects could be allocated. I have a feeling that the 2nd fix prevented that by pre-loading the Cleaner class at Reference class initialization time. But because it was hard to reproduce the situation where OOME was thrown from (r instanceof Cleaner) check, we nevertheless handled this hypothetical situation. Perhaps it would be better that we didn't and just see if OOME returned after just adding the pre-loading of Cleaner class... So here we are, at an attempt to clean this up: https://bugs.openjdk.java.net/browse/JDK-8055232 We can move (r instanceof Cleaner) check outside of synchronized block to where it was before the 2nd fix and wait what stress tests will show. Another possibility is to move the instanceof check outside of synchronized block, but handle the hypothetical OOME by re-linking the unlinked reference back into the pending chain: http://cr.openjdk.java.net/~plevart/jdk9-dev/ReferenceHandlerExceptions/webr... What would you suggest? Regards, Peter
Hi Peter, Two comments: 1. Doing the Object.wait() violates the thou-shalt-not-allocate rule, due to the potential for creating the InterruptedException. But there is no way to avoid that. 2. I don't see how the instanceof can still result in OOME if we have pre-loaded and initialized the relevant classes. So I'm not sure what it is that needs fixing now ?? Note that if the OOME is thrown while holding the lock it means we haven't deadlocked with the GC. David On 19/09/2014 4:43 PM, Peter Levart wrote:
Hi,
This story has a long tail. It started with:
https://bugs.openjdk.java.net/browse/JDK-7038914
Some stress tests triggered OOME in ReferenceHandler thread which would die. The first attempt at fixing this was the following discussion:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg16250.html
Which resulted in patch:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/0b8dab7fec54
This assumed that ReferenceHandler thread doing Object.wait() could be interrupted (by stress test - normal application don't do that) and failed to allocate InterruptedException object. A jtreg test was designed which triggered that situation and a fix would catch OOME and ignore it.
But the stress tests (the same or some other, I don't know) apparently were not entirely happy with this fix. The following discussion describes this:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg23596.html
The other "unprotected" point at which OOME could be thrown and was later confirmed by debugging is (r instanceof Cleaner) test. The assumption was that it could trigger Cleaner class loading at 1st execution which would cause OOME to be thrown. The fix that finally silenced stress tests was the following:
http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d04102f69d46
This part of code (the j.l.r.Reference class and its members) has undergone further changes afterwards for fixing another bug:
https://bugs.openjdk.java.net/browse/JDK-6857566
With following patch:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/9934d34ed3c0
But this did not change the code paths - just factored-out the content of the loop into a separate method that could be used from outside.
All well until kim.barrett@oracle.com noticed that the 2nd fix introduced a potentially illegal situation. There is a j.l.r.Reference.Lock inner class and a singleton object assigned to static field in j.l.Reference class with the following notice:
/* Object used to synchronize with the garbage collector. The collector * must acquire this lock at the beginning of each collection cycle. It is * therefore critical that any code holding this lock complete as quickly * as possible, allocate no new objects, and avoid calling user code. */ static private class Lock { } private static Lock lock = new Lock();
The conflicting part is "allocate no new objects". Catching OOME inside a synchronized block holding this lock implies that new objects could be allocated. I have a feeling that the 2nd fix prevented that by pre-loading the Cleaner class at Reference class initialization time. But because it was hard to reproduce the situation where OOME was thrown from (r instanceof Cleaner) check, we nevertheless handled this hypothetical situation. Perhaps it would be better that we didn't and just see if OOME returned after just adding the pre-loading of Cleaner class...
So here we are, at an attempt to clean this up:
https://bugs.openjdk.java.net/browse/JDK-8055232
We can move (r instanceof Cleaner) check outside of synchronized block to where it was before the 2nd fix and wait what stress tests will show. Another possibility is to move the instanceof check outside of synchronized block, but handle the hypothetical OOME by re-linking the unlinked reference back into the pending chain:
http://cr.openjdk.java.net/~plevart/jdk9-dev/ReferenceHandlerExceptions/webr...
What would you suggest?
Regards, Peter
On 09/19/2014 09:06 AM, David Holmes wrote:
Hi Peter,
Two comments:
1. Doing the Object.wait() violates the thou-shalt-not-allocate rule, due to the potential for creating the InterruptedException. But there is no way to avoid that.
It depends on when the InterruptedException is allocated. While Object.wait()-ing, the lock is not held. If, while waiting, the interruption is detected, it could be that InterruptedException is allocated before the lock is attempted to be re-gained. This can easily be verified - stay tuned...
2. I don't see how the instanceof can still result in OOME if we have pre-loaded and initialized the relevant classes.
I don't see either. So you're more for a variant that we move instanceof check outside the synchronized block to where it was before and wait for stress tests to show if we're right? This might be a cleaner way to final solution...
So I'm not sure what it is that needs fixing now ?? Note that if the OOME is thrown while holding the lock it means we haven't deadlocked with the GC.
This might be an indication that there's not any allocation taking place in instanceof check any more (wait is different, since it's releasing lock temporarily). If it was taking place, the stress tests that detected OOME before, would detect deadlock now, right? I understand that this is more of a clean-up attempt than fixing any real issue. Regards, Peter
David
On 19/09/2014 4:43 PM, Peter Levart wrote:
Hi,
This story has a long tail. It started with:
https://bugs.openjdk.java.net/browse/JDK-7038914
Some stress tests triggered OOME in ReferenceHandler thread which would die. The first attempt at fixing this was the following discussion:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg16250.html
Which resulted in patch:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/0b8dab7fec54
This assumed that ReferenceHandler thread doing Object.wait() could be interrupted (by stress test - normal application don't do that) and failed to allocate InterruptedException object. A jtreg test was designed which triggered that situation and a fix would catch OOME and ignore it.
But the stress tests (the same or some other, I don't know) apparently were not entirely happy with this fix. The following discussion describes this:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg23596.html
The other "unprotected" point at which OOME could be thrown and was later confirmed by debugging is (r instanceof Cleaner) test. The assumption was that it could trigger Cleaner class loading at 1st execution which would cause OOME to be thrown. The fix that finally silenced stress tests was the following:
http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d04102f69d46
This part of code (the j.l.r.Reference class and its members) has undergone further changes afterwards for fixing another bug:
https://bugs.openjdk.java.net/browse/JDK-6857566
With following patch:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/9934d34ed3c0
But this did not change the code paths - just factored-out the content of the loop into a separate method that could be used from outside.
All well until kim.barrett@oracle.com noticed that the 2nd fix introduced a potentially illegal situation. There is a j.l.r.Reference.Lock inner class and a singleton object assigned to static field in j.l.Reference class with the following notice:
/* Object used to synchronize with the garbage collector. The collector * must acquire this lock at the beginning of each collection cycle. It is * therefore critical that any code holding this lock complete as quickly * as possible, allocate no new objects, and avoid calling user code. */ static private class Lock { } private static Lock lock = new Lock();
The conflicting part is "allocate no new objects". Catching OOME inside a synchronized block holding this lock implies that new objects could be allocated. I have a feeling that the 2nd fix prevented that by pre-loading the Cleaner class at Reference class initialization time. But because it was hard to reproduce the situation where OOME was thrown from (r instanceof Cleaner) check, we nevertheless handled this hypothetical situation. Perhaps it would be better that we didn't and just see if OOME returned after just adding the pre-loading of Cleaner class...
So here we are, at an attempt to clean this up:
https://bugs.openjdk.java.net/browse/JDK-8055232
We can move (r instanceof Cleaner) check outside of synchronized block to where it was before the 2nd fix and wait what stress tests will show. Another possibility is to move the instanceof check outside of synchronized block, but handle the hypothetical OOME by re-linking the unlinked reference back into the pending chain:
http://cr.openjdk.java.net/~plevart/jdk9-dev/ReferenceHandlerExceptions/webr...
What would you suggest?
Regards, Peter
On 19/09/2014 5:35 PM, Peter Levart wrote:
On 09/19/2014 09:06 AM, David Holmes wrote:
Hi Peter,
Two comments:
1. Doing the Object.wait() violates the thou-shalt-not-allocate rule, due to the potential for creating the InterruptedException. But there is no way to avoid that.
It depends on when the InterruptedException is allocated. While Object.wait()-ing, the lock is not held. If, while waiting, the interruption is detected, it could be that InterruptedException is allocated before the lock is attempted to be re-gained. This can easily be verified - stay tuned...
Allocation happens after the monitor is re-acquired.
2. I don't see how the instanceof can still result in OOME if we have pre-loaded and initialized the relevant classes.
I don't see either. So you're more for a variant that we move instanceof check outside the synchronized block to where it was before and wait for stress tests to show if we're right? This might be a cleaner way to final solution...
I think I see what you are saying now ...
So I'm not sure what it is that needs fixing now ?? Note that if the OOME is thrown while holding the lock it means we haven't deadlocked with the GC.
This might be an indication that there's not any allocation taking place in instanceof check any more (wait is different, since it's releasing lock temporarily). If it was taking place, the stress tests that detected OOME before, would detect deadlock now, right?
There's only a potential for deadlock, it isn't inevitable. You'd need to see when the GC needs to acquire this lock.
I understand that this is more of a clean-up attempt than fixing any real issue.
I'm inclined to take the "if it ain't broke ..." position on this. But I do see your point. Cheers, David
Regards, Peter
David
On 19/09/2014 4:43 PM, Peter Levart wrote:
Hi,
This story has a long tail. It started with:
https://bugs.openjdk.java.net/browse/JDK-7038914
Some stress tests triggered OOME in ReferenceHandler thread which would die. The first attempt at fixing this was the following discussion:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg16250.html
Which resulted in patch:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/0b8dab7fec54
This assumed that ReferenceHandler thread doing Object.wait() could be interrupted (by stress test - normal application don't do that) and failed to allocate InterruptedException object. A jtreg test was designed which triggered that situation and a fix would catch OOME and ignore it.
But the stress tests (the same or some other, I don't know) apparently were not entirely happy with this fix. The following discussion describes this:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg23596.html
The other "unprotected" point at which OOME could be thrown and was later confirmed by debugging is (r instanceof Cleaner) test. The assumption was that it could trigger Cleaner class loading at 1st execution which would cause OOME to be thrown. The fix that finally silenced stress tests was the following:
http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d04102f69d46
This part of code (the j.l.r.Reference class and its members) has undergone further changes afterwards for fixing another bug:
https://bugs.openjdk.java.net/browse/JDK-6857566
With following patch:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/9934d34ed3c0
But this did not change the code paths - just factored-out the content of the loop into a separate method that could be used from outside.
All well until kim.barrett@oracle.com noticed that the 2nd fix introduced a potentially illegal situation. There is a j.l.r.Reference.Lock inner class and a singleton object assigned to static field in j.l.Reference class with the following notice:
/* Object used to synchronize with the garbage collector. The collector * must acquire this lock at the beginning of each collection cycle. It is * therefore critical that any code holding this lock complete as quickly * as possible, allocate no new objects, and avoid calling user code. */ static private class Lock { } private static Lock lock = new Lock();
The conflicting part is "allocate no new objects". Catching OOME inside a synchronized block holding this lock implies that new objects could be allocated. I have a feeling that the 2nd fix prevented that by pre-loading the Cleaner class at Reference class initialization time. But because it was hard to reproduce the situation where OOME was thrown from (r instanceof Cleaner) check, we nevertheless handled this hypothetical situation. Perhaps it would be better that we didn't and just see if OOME returned after just adding the pre-loading of Cleaner class...
So here we are, at an attempt to clean this up:
https://bugs.openjdk.java.net/browse/JDK-8055232
We can move (r instanceof Cleaner) check outside of synchronized block to where it was before the 2nd fix and wait what stress tests will show. Another possibility is to move the instanceof check outside of synchronized block, but handle the hypothetical OOME by re-linking the unlinked reference back into the pending chain:
http://cr.openjdk.java.net/~plevart/jdk9-dev/ReferenceHandlerExceptions/webr...
What would you suggest?
Regards, Peter
On 09/19/2014 09:45 AM, David Holmes wrote:
On 19/09/2014 5:35 PM, Peter Levart wrote:
On 09/19/2014 09:06 AM, David Holmes wrote:
Hi Peter,
Two comments:
1. Doing the Object.wait() violates the thou-shalt-not-allocate rule, due to the potential for creating the InterruptedException. But there is no way to avoid that.
It depends on when the InterruptedException is allocated. While Object.wait()-ing, the lock is not held. If, while waiting, the interruption is detected, it could be that InterruptedException is allocated before the lock is attempted to be re-gained. This can easily be verified - stay tuned...
Allocation happens after the monitor is re-acquired.
You're right. I have verified this with a simple test (a modified InterruptedException class). But this means that interrupting ReferenceHandler thread is already a violation of thou-shalt-not-allocate rule. Do you think that a re-design of VM <-> Java interface is warranted? Now that we have Unsafe, we could handle a queue of 'pending' references using CAS. There is a single blocking consumer of that queue - a ReferenceHandler thread. Other consumers (calling tryHandlePending() from nio.Bits) are non-blocking, just polling the queue. So there could be a single static: private static ReferenceHandler waiter; field in Reference class, holding the reference to ReferenceHandler thread while it is Unsafe.park()-ed - no allocation necessary. The VM could just unpark the waiter after CAS-ing new references on the pending list... But that's another issue. We're now asking ourselves if "instanceof" call is in need of a clean-up, not wait(). Regards, Peter
2. I don't see how the instanceof can still result in OOME if we have pre-loaded and initialized the relevant classes.
I don't see either. So you're more for a variant that we move instanceof check outside the synchronized block to where it was before and wait for stress tests to show if we're right? This might be a cleaner way to final solution...
I think I see what you are saying now ...
So I'm not sure what it is that needs fixing now ?? Note that if the OOME is thrown while holding the lock it means we haven't deadlocked with the GC.
This might be an indication that there's not any allocation taking place in instanceof check any more (wait is different, since it's releasing lock temporarily). If it was taking place, the stress tests that detected OOME before, would detect deadlock now, right?
There's only a potential for deadlock, it isn't inevitable. You'd need to see when the GC needs to acquire this lock.
I understand that this is more of a clean-up attempt than fixing any real issue.
I'm inclined to take the "if it ain't broke ..." position on this. But I do see your point.
Cheers, David
Regards, Peter
David
On 19/09/2014 4:43 PM, Peter Levart wrote:
Hi,
This story has a long tail. It started with:
https://bugs.openjdk.java.net/browse/JDK-7038914
Some stress tests triggered OOME in ReferenceHandler thread which would die. The first attempt at fixing this was the following discussion:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg16250.html
Which resulted in patch:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/0b8dab7fec54
This assumed that ReferenceHandler thread doing Object.wait() could be interrupted (by stress test - normal application don't do that) and failed to allocate InterruptedException object. A jtreg test was designed which triggered that situation and a fix would catch OOME and ignore it.
But the stress tests (the same or some other, I don't know) apparently were not entirely happy with this fix. The following discussion describes this:
https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg23596.html
The other "unprotected" point at which OOME could be thrown and was later confirmed by debugging is (r instanceof Cleaner) test. The assumption was that it could trigger Cleaner class loading at 1st execution which would cause OOME to be thrown. The fix that finally silenced stress tests was the following:
http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d04102f69d46
This part of code (the j.l.r.Reference class and its members) has undergone further changes afterwards for fixing another bug:
https://bugs.openjdk.java.net/browse/JDK-6857566
With following patch:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/9934d34ed3c0
But this did not change the code paths - just factored-out the content of the loop into a separate method that could be used from outside.
All well until kim.barrett@oracle.com noticed that the 2nd fix introduced a potentially illegal situation. There is a j.l.r.Reference.Lock inner class and a singleton object assigned to static field in j.l.Reference class with the following notice:
/* Object used to synchronize with the garbage collector. The collector * must acquire this lock at the beginning of each collection cycle. It is * therefore critical that any code holding this lock complete as quickly * as possible, allocate no new objects, and avoid calling user code. */ static private class Lock { } private static Lock lock = new Lock();
The conflicting part is "allocate no new objects". Catching OOME inside a synchronized block holding this lock implies that new objects could be allocated. I have a feeling that the 2nd fix prevented that by pre-loading the Cleaner class at Reference class initialization time. But because it was hard to reproduce the situation where OOME was thrown from (r instanceof Cleaner) check, we nevertheless handled this hypothetical situation. Perhaps it would be better that we didn't and just see if OOME returned after just adding the pre-loading of Cleaner class...
So here we are, at an attempt to clean this up:
https://bugs.openjdk.java.net/browse/JDK-8055232
We can move (r instanceof Cleaner) check outside of synchronized block to where it was before the 2nd fix and wait what stress tests will show. Another possibility is to move the instanceof check outside of synchronized block, but handle the hypothetical OOME by re-linking the unlinked reference back into the pending chain:
http://cr.openjdk.java.net/~plevart/jdk9-dev/ReferenceHandlerExceptions/webr...
What would you suggest?
Regards, Peter
participants (2)
-
David Holmes
-
Peter Levart