RFR: 8055232 (ref) Exceptions while processing Reference pending list

Fri Sep 19 08:28:51 UTC 2014

On 09/19/2014 09:45 AM, David Holmes wrote:
> On 19/09/2014 5:35 PM, Peter Levart wrote:
>> On 09/19/2014 09:06 AM, David Holmes wrote:
>>> Hi Peter,
>>>
>>> Two comments:
>>>
>>> 1. Doing the Object.wait() violates the thou-shalt-not-allocate rule,
>>> due to the potential for creating the InterruptedException. But there
>>> is no way to avoid that.
>>
>> It depends on when the InterruptedException is allocated. While
>> Object.wait()-ing, the lock is not held. If, while waiting, the
>> interruption is detected, it could be that InterruptedException is
>> allocated before the lock is attempted to be re-gained. This can easily
>> be verified - stay tuned...
>
> Allocation happens after the monitor is re-acquired.

You're right. I have verified this with a simple test (a modified 
InterruptedException class). But this means that interrupting 
ReferenceHandler thread is already a violation of 
thou-shalt-not-allocate rule.

Do you think that a re-design of VM <-> Java interface is warranted?

Now that we have Unsafe, we could handle a queue of 'pending' references 
using CAS. There is a single blocking consumer of that queue - a 
ReferenceHandler thread. Other consumers (calling tryHandlePending() 
from nio.Bits) are non-blocking, just polling the queue. So there could 
be a single static:

private static ReferenceHandler waiter;

field in Reference class, holding the reference to ReferenceHandler 
thread while it is Unsafe.park()-ed - no allocation necessary. The VM 
could just unpark the waiter after CAS-ing new references on the pending 
list...

But that's another issue. We're now asking ourselves if "instanceof" 
call is in need of a clean-up, not wait().

Regards, Peter

>
>>>
>>> 2. I don't see how the instanceof can still result in OOME if we have
>>> pre-loaded and initialized the relevant classes.
>>
>> I don't see either. So you're more for a variant that we move instanceof
>> check outside the synchronized block to where it was before and wait for
>> stress tests to show if we're right? This might be a cleaner way to
>> final solution...
>
> I think I see what you are saying now ...
>
>>>
>>> So I'm not sure what it is that needs fixing now ?? Note that if the
>>> OOME is thrown while holding the lock it means we haven't deadlocked
>>> with the GC.
>>
>> This might be an indication that there's not any allocation taking place
>> in instanceof check any more (wait is different, since it's releasing
>> lock temporarily). If it was taking place, the stress tests that
>> detected OOME before, would detect deadlock now, right?
>
> There's only a potential for deadlock, it isn't inevitable. You'd need 
> to see when the GC needs to acquire this lock.
>
>> I understand that this is more of a clean-up attempt than fixing any
>> real issue.
>
> I'm inclined to take the "if it ain't broke ..." position on this. But 
> I do see your point.
>
> Cheers,
> David
>
>> Regards, Peter
>>
>>>
>>> David
>>>
>>> On 19/09/2014 4:43 PM, Peter Levart wrote:
>>>> Hi,
>>>>
>>>> This story has a long tail. It started with:
>>>>
>>>>      https://bugs.openjdk.java.net/browse/JDK-7038914
>>>>
>>>> Some stress tests triggered OOME in ReferenceHandler thread which 
>>>> would
>>>> die. The first attempt at fixing this was the following discussion:
>>>>
>>>> https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg16250.html 
>>>>
>>>>
>>>>
>>>> Which resulted in patch:
>>>>
>>>> http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/0b8dab7fec54
>>>>
>>>> This assumed that ReferenceHandler thread doing Object.wait() could be
>>>> interrupted (by stress test - normal application don't do that) and
>>>> failed to allocate InterruptedException object. A jtreg test was
>>>> designed which triggered that situation and a fix would catch OOME and
>>>> ignore it.
>>>>
>>>> But the stress tests (the same or some other, I don't know) apparently
>>>> were not entirely happy with this fix. The following discussion
>>>> describes this:
>>>>
>>>> https://www.mail-archive.com/core-libs-dev%40openjdk.java.net/msg23596.html 
>>>>
>>>>
>>>>
>>>> The other "unprotected" point at which OOME could be thrown and was
>>>> later confirmed by debugging is (r instanceof Cleaner) test. The
>>>> assumption was that it could trigger Cleaner class loading at 1st
>>>> execution which would cause OOME to be thrown. The fix that finally
>>>> silenced stress tests was the following:
>>>>
>>>> http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d04102f69d46
>>>>
>>>> This part of code (the j.l.r.Reference class and its members) has
>>>> undergone further changes afterwards for fixing another bug:
>>>>
>>>>      https://bugs.openjdk.java.net/browse/JDK-6857566
>>>>
>>>> With following patch:
>>>>
>>>> http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/9934d34ed3c0
>>>>
>>>> But this did not change the code paths - just factored-out the content
>>>> of the loop into a separate method that could be used from outside.
>>>>
>>>> All well until kim.barrett at oracle.com noticed that the 2nd fix
>>>> introduced a potentially illegal situation. There is a
>>>> j.l.r.Reference.Lock inner class and a singleton object assigned to
>>>> static field in j.l.Reference class with the following notice:
>>>>
>>>>      /* Object used to synchronize with the garbage collector. The
>>>> collector
>>>>       * must acquire this lock at the beginning of each collection
>>>> cycle.  It is
>>>>       * therefore critical that any code holding this lock complete as
>>>> quickly
>>>>       * as possible, allocate no new objects, and avoid calling user
>>>> code.
>>>>       */
>>>>      static private class Lock { }
>>>>      private static Lock lock = new Lock();
>>>>
>>>> The conflicting part is "allocate no new objects". Catching OOME 
>>>> inside
>>>> a synchronized block holding this lock implies that new objects 
>>>> could be
>>>> allocated. I have a feeling that the 2nd fix prevented that by
>>>> pre-loading the Cleaner class at Reference class initialization time.
>>>> But because it was hard to reproduce the situation where OOME was 
>>>> thrown
>>>> from (r instanceof Cleaner) check, we nevertheless handled this
>>>> hypothetical situation. Perhaps it would be better that we didn't and
>>>> just see if OOME returned after just adding the pre-loading of Cleaner
>>>> class...
>>>>
>>>> So here we are, at an attempt to clean this up:
>>>>
>>>>      https://bugs.openjdk.java.net/browse/JDK-8055232
>>>>
>>>> We can move (r instanceof Cleaner) check outside of synchronized block
>>>> to where it was before the 2nd fix and wait what stress tests will 
>>>> show.
>>>> Another possibility is to move the instanceof check outside of
>>>> synchronized block, but handle the hypothetical OOME by re-linking the
>>>> unlinked reference back into the pending chain:
>>>>
>>>> http://cr.openjdk.java.net/~plevart/jdk9-dev/ReferenceHandlerExceptions/webrev.01/ 
>>>>
>>>>
>>>>
>>>>
>>>> What would you suggest?
>>>>
>>>> Regards, Peter
>>>>
>>