RFR (XS) 8161280 - assert failed: reference count underflow for symbol
Ioi Lam
ioi.lam at oracle.com
Wed Aug 24 03:44:29 UTC 2016
On 8/23/16 7:54 PM, David Holmes wrote:
> On 24/08/2016 12:24 PM, Ioi Lam wrote:
>>
>>
>> On 8/23/16 6:51 PM, David Holmes wrote:
>>> On 24/08/2016 11:32 AM, Ioi Lam wrote:
>>>> On 8/23/16 6:08 PM, David Holmes wrote:
>>>>> On 24/08/2016 11:03 AM, Ioi Lam wrote:
>>>>>> Hi Coleen, thanks for suggestion the simplification:
>>>>>>
>>>>>> void Symbol::decrement_refcount() {
>>>>>> #ifdef ASSERT
>>>>>> if (_refcount == 0) {
>>>>>> print();
>>>>>> assert(false, "reference count underflow for symbol");
>>>>>> }
>>>>>> }
>>>>>> #endif
>>>>>> Atomic::dec(&_refcount);
>>>>>> }
>>>>>>
>>>>>> There's a race condition that won't detect the underflow. E.g.,
>>>>>> refcount
>>>>>> is 1. Two threads comes in and decrement at the same time. We
>>>>>> will end
>>>>>> up with -1.
>>>>>
>>>>> So if we're going this path then you can get rid of the race by using
>>>>> Atomic::add(&_refcount, -1), which returns the updated value. If you
>>>>> get back -1 then assert.
>>>>>
>>>>
>>>> The problem is we allow -1 to mean "permanent". Symbols in the CDS
>>>> archive are mapped read-only and have refcount==-1. If we decrement
>>>> them
>>>> we get a SEGV.
>>>
>>> Unless you set them using the decrement_refcount function I don't see
>>> how this is a problem. The assertion would only trigger if we
>>> decrement from zero to -1.
>>>
>>>> Also, if we blindly decrement the refcount, we could end up rolling
>>>> back
>>>> all the way from 0xffff to 0x0, at which point we may free the Symbol
>>>> which may still be in use.
>>>
>>> ?? As soon as a bad decrement happens the assert is trigerred and the
>>> VM is dead.
>>
>> The assert doesn't happen in product VM. Currently the product VM will
>> continue to work, and all symbols with underflown/overflown refcounts
>> will be considered as permanent.
>
> Okay - note this bug is about an assertion failure :)
>
> So there are two problems:
>
> 1. There is a race related to the assert that the Atomic::add(-1) can
> fix.
>
> 2. There is the product-mode unconditional decrementing of the
> refcount which will mark everything as "permanent". I haven't see any
> suggestions to address that existing issue. One obvious suggestion is
> to use a different value than -1 so it doesn't easily get hit by
> unexpected underflow.
>
Here's an updated version that's an improvement over my original patch:
+ A negative refcount means permanent symbols.
+ Only non-permanent symbols can be incremented/decremented
+ Underflow is detected by 0 -> -1 transition. Real underflows are
always caught.
+ Theoretically there's still a race condition for false asserts:
[1] thread A observes that refcount is 0x7fff, proceeds to decrement it
[2] but in the mean time, around 32768 threads come in. They all observe
that the refcount is non-negative, and then all proceed to increment the
refcount to 0x0.
[3] thread A decrements the refcount and observes that refcount is -1
afterwards
but I think this can be safely ignored :-)
http://cr.openjdk.java.net/~iklam/jdk9/8161280-symbol-refcount-underflow.v02/
I like this version as it doesn't change the logic for the product VM
and only limits the possibility of false asserts.
Thanks
- Ioi
> David
>
>> - Ioi
>>
>>
>>>
>>> David
>>> -----
>>>
>>>> The fundamental problem is we need a two-step operation and we avoid
>>>> proper synchronization (for performance/avoiding deadlock/etc). So we
>>>> will always have a race condition somewhere, and we just need to allow
>>>> only the benign ones.
>>>>
>>>> - Ioi
>>>>
>>>>
>>>>> Cheers,
>>>>> David
>>>>>
>>>>>> However, it's not worse than before. The old version also has a race
>>>>>> condition:
>>>>>>
>>>>>> refcount is 0
>>>>>> thread A decrements
>>>>>> thread B increments
>>>>>> thread A checks for underflow
>>>>>>
>>>>>> the decrementing thread will read _refcount==0 at the end so it
>>>>>> won't
>>>>>> detect the (transient) underflow.
>>>>>>
>>>>>> I think the failure to detect underflow is fine, since this happens
>>>>>> only
>>>>>> with concurrent access. The kinds of underflow that we are
>>>>>> interested
>>>>>> usually can be caught in single-threaded situations.
>>>>>>
>>>>>> Thanks
>>>>>> - Ioi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/23/16 4:24 PM, Coleen Phillimore wrote:
>>>>>>>
>>>>>>> This doesn't make sense for me and I have to go in gdb to print out
>>>>>>> what -16384 is. It appears that this is trying to detect that we
>>>>>>> went below zero from zero, which is an error, but this isn't
>>>>>>> clear at
>>>>>>> all.
>>>>>>>
>>>>>>> It seems that
>>>>>>>
>>>>>>> if (_refcount >= 0) {
>>>>>>>
>>>>>>>
>>>>>>> Should be > 0 and we should assert if this is ever zero instead,
>>>>>>> and
>>>>>>> allow anything negative to mean that this count has gone immortal.
>>>>>>>
>>>>>>> Kim thought it should use CAS rather than atomic increment and
>>>>>>> decrement, but maybe that isn't necessary, especially since there
>>>>>>> isn't a short version of cmpxchg.
>>>>>>>
>>>>>>> thanks,
>>>>>>> Coleen
>>>>>>>
>>>>>>> On 8/23/16 6:01 AM, Ioi Lam wrote:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8161280
>>>>>>>> http://cr.openjdk.java.net/~iklam/jdk9/8161280-symbol-refcount-underflow.v01/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Summary:
>>>>>>>>
>>>>>>>> The test was loading a lot of JCK classes into the same VM.
>>>>>>>> Many of
>>>>>>>> the JCK classes refer to "javasoft/sqe/javatest/Status", so the
>>>>>>>> refcount (a signed short integer) of this Symbol would run up and
>>>>>>>> past 0x7fff.
>>>>>>>>
>>>>>>>> The assert was caused by a race condition: the refcount started
>>>>>>>> with
>>>>>>>> a large (16-bit) positive value such as 0x7fff, one thread is
>>>>>>>> decrementing and several other threads are incrementing. The
>>>>>>>> refcount
>>>>>>>> will end up being 0x8000 or slightly higher (limited to the
>>>>>>>> number of
>>>>>>>> concurrent threads that are running within a small window of
>>>>>>>> several
>>>>>>>> instructions in the decrementing thread, so most likely it will be
>>>>>>>> 0x800?).
>>>>>>>>
>>>>>>>> As a result, the decrementing thread found that the refecount is
>>>>>>>> negative after the operation, and thought that an underflow had
>>>>>>>> happened.
>>>>>>>>
>>>>>>>> The fix is to ignore any value that may appear in the [0x8000 -
>>>>>>>> 0xbfff] range and do not flag these as underflows (since they are
>>>>>>>> most likely overflows -- overflows are already handled by
>>>>>>>> making the
>>>>>>>> Symbol permanent).
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> - Ioi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
More information about the hotspot-runtime-dev
mailing list