RFR (XS) 8161280 - assert failed: reference count underflow for symbol

Wed Aug 24 02:54:01 UTC 2016

On 24/08/2016 12:24 PM, Ioi Lam wrote:
>
>
> On 8/23/16 6:51 PM, David Holmes wrote:
>> On 24/08/2016 11:32 AM, Ioi Lam wrote:
>>> On 8/23/16 6:08 PM, David Holmes wrote:
>>>> On 24/08/2016 11:03 AM, Ioi Lam wrote:
>>>>> Hi Coleen, thanks for suggestion the simplification:
>>>>>
>>>>> void Symbol::decrement_refcount() {
>>>>> #ifdef ASSERT
>>>>>   if (_refcount == 0) {
>>>>>       print();
>>>>>       assert(false, "reference count underflow for symbol");
>>>>>     }
>>>>>   }
>>>>> #endif
>>>>>   Atomic::dec(&_refcount);
>>>>> }
>>>>>
>>>>> There's a race condition that won't detect the underflow. E.g.,
>>>>> refcount
>>>>> is 1. Two threads comes in and decrement at the same time. We will end
>>>>> up with -1.
>>>>
>>>> So if we're going this path then you can get rid of the race by using
>>>> Atomic::add(&_refcount, -1), which returns the updated value. If you
>>>> get back -1 then assert.
>>>>
>>>
>>> The problem is we allow -1 to mean "permanent". Symbols in the CDS
>>> archive are mapped read-only and have refcount==-1. If we decrement them
>>> we get a SEGV.
>>
>> Unless you set them using the decrement_refcount function I don't see
>> how this is a problem. The assertion would only trigger if we
>> decrement from zero to -1.
>>
>>> Also, if we blindly decrement the refcount, we could end up rolling back
>>> all the way from 0xffff to 0x0, at which point we may free the Symbol
>>> which may still be in use.
>>
>> ?? As soon as a bad decrement happens the assert is  trigerred and the
>> VM is dead.
>
> The assert doesn't happen in product VM. Currently the product VM will
> continue to work, and all symbols with underflown/overflown refcounts
> will be considered as permanent.

Okay - note this bug is about an assertion failure :)

So there are two problems:

1. There is a race related to the assert that the Atomic::add(-1) can fix.

2. There is the product-mode unconditional decrementing of the refcount 
which will mark everything as "permanent". I haven't see any suggestions 
to address that existing issue. One obvious suggestion is to use a 
different value than -1 so it doesn't easily get hit by unexpected 
underflow.

David

> - Ioi
>
>
>>
>> David
>> -----
>>
>>> The fundamental problem is we need a two-step operation and we avoid
>>> proper synchronization (for performance/avoiding deadlock/etc). So we
>>> will always have a race condition somewhere, and we just need to allow
>>> only the benign ones.
>>>
>>> - Ioi
>>>
>>>
>>>> Cheers,
>>>> David
>>>>
>>>>> However, it's not worse than before. The old version also has a race
>>>>> condition:
>>>>>
>>>>>     refcount is 0
>>>>>     thread A decrements
>>>>>     thread B increments
>>>>>     thread A checks for underflow
>>>>>
>>>>> the decrementing thread will read _refcount==0 at the end so it won't
>>>>> detect the (transient) underflow.
>>>>>
>>>>> I think the failure to detect underflow is fine, since this happens
>>>>> only
>>>>> with concurrent access. The kinds of underflow that we are interested
>>>>> usually can be caught in single-threaded situations.
>>>>>
>>>>> Thanks
>>>>> - Ioi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8/23/16 4:24 PM, Coleen Phillimore wrote:
>>>>>>
>>>>>> This doesn't make sense for me and I have to go in gdb to print out
>>>>>> what -16384 is.   It appears that this is trying to detect that we
>>>>>> went below zero from zero, which is an error, but this isn't clear at
>>>>>> all.
>>>>>>
>>>>>> It seems that
>>>>>>
>>>>>>    if (_refcount >= 0) {
>>>>>>
>>>>>>
>>>>>> Should be > 0 and we should assert if this is ever zero instead, and
>>>>>> allow anything negative to mean that this count has gone immortal.
>>>>>>
>>>>>> Kim thought it should use CAS rather than atomic increment and
>>>>>> decrement, but maybe that isn't necessary, especially since there
>>>>>> isn't a short version of cmpxchg.
>>>>>>
>>>>>> thanks,
>>>>>> Coleen
>>>>>>
>>>>>> On 8/23/16 6:01 AM, Ioi Lam wrote:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8161280
>>>>>>> http://cr.openjdk.java.net/~iklam/jdk9/8161280-symbol-refcount-underflow.v01/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Summary:
>>>>>>>
>>>>>>> The test was loading a lot of JCK classes into the same VM. Many of
>>>>>>> the JCK classes refer to "javasoft/sqe/javatest/Status", so the
>>>>>>> refcount (a signed short integer) of this Symbol would run up and
>>>>>>> past 0x7fff.
>>>>>>>
>>>>>>> The assert was caused by a race condition: the refcount started with
>>>>>>> a large (16-bit) positive value such as 0x7fff, one thread is
>>>>>>> decrementing and several other threads are incrementing. The
>>>>>>> refcount
>>>>>>> will end up being 0x8000 or slightly higher (limited to the
>>>>>>> number of
>>>>>>> concurrent threads that are running within a small window of several
>>>>>>> instructions in the decrementing thread, so most likely it will be
>>>>>>> 0x800?).
>>>>>>>
>>>>>>> As a result, the decrementing thread found that the refecount is
>>>>>>> negative after the operation, and thought that an underflow had
>>>>>>> happened.
>>>>>>>
>>>>>>> The fix is to ignore any value that may appear in the [0x8000 -
>>>>>>> 0xbfff] range and do not flag these as underflows (since they are
>>>>>>> most likely overflows -- overflows are already handled by making the
>>>>>>> Symbol permanent).
>>>>>>>
>>>>>>> Thanks
>>>>>>> - Ioi
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>