RFR (XS) 8161280 - assert failed: reference count underflow for symbol

Ioi Lam ioi.lam at oracle.com
Wed Aug 24 12:37:10 UTC 2016



On 8/24/16 5:14 AM, Coleen Phillimore wrote:
>
> I still don't want us to observe a _refcount == 0 - this is always an 
> error.
>
> + if (_refcount >= 0) { // not a permanent symbol
>
>
> Can this be changed to something like:
>
> + volatile int ref = _refcount;
> + assert(ref != 0, "underflow");
>
> + if (ref > 0) { // not a permanent symbol
>
>
This will probably mess up product VM, as refcount will stay being zero. 
I'd rather have it fall negative and become a permanent symbol.

The current code will make the product VM more resilient even if 
underflow happens.

Thanks
- Ioi

> I really like the atomic::add checking return value.
>
> I'm working on a change for later that will eagerly delete symbols 
> whose refcounts go to zero.
>
> Thanks,
> Coleen
>
> On 8/24/16 4:28 AM, Ioi Lam wrote:
>>
>>
>> On 8/24/16 12:14 AM, David Holmes wrote:
>>> Hi Ioi,
>>>
>>> On 24/08/2016 5:01 PM, Ioi Lam wrote:
>>>> Hi David,
>>>>
>>>> Here's an updated version that added Atomic::add(jshort*, jshort) 
>>>> as you
>>>> suggested.
>>>
>>> Thanks. Looks good.
>>>
>>>> To appease the "unused" warnings, I just added (void)new_value.
>>>
>>> I think I prefer the DEBUG_ONLY version. :)
>>>
>>
>> I am not a big fan of ending DEBUG_ONLY with a "=", so I'll keep the 
>> code as is :-)
>>
>> - Ioi
>>
>>
>>> Cheers,
>>> David
>>>
>>>> http://cr.openjdk.java.net/~iklam/jdk9/8161280-symbol-refcount-underflow.v03/ 
>>>>
>>>>
>>>>
>>>> I am running RBT with "--test
>>>> hotspot/test/:hotspot_all,vm.parallel_class_loading,vm.runtime.testlist" 
>>>>
>>>> to make sure everything works.
>>>>
>>>> Thanks
>>>> - Ioi
>>>>
>>>> On 8/23/16 10:12 PM, David Holmes wrote:
>>>>> Hi Ioi,
>>>>>
>>>>> Sorry I have a problem here. I just realized there is no
>>>>> Atomic::add(jshort*, jshort). But I can't accept changing the
>>>>> Atomic::inc/dec API only for the jshort case (and why return jint 
>>>>> ??).
>>>>> So either the whole Atomic inc/dec API needs updating across the 
>>>>> board
>>>>> (not a small task) or else we need to introduce Atomic::add(jshort*,
>>>>> jshort). The latter seems a small task - the existing inc/dec
>>>>> implementations can call the new add version.
>>>>>
>>>>> Also in a product build might the "new_value" variable trigger an
>>>>> "unused" warning? If so an ugly option would be:
>>>>>
>>>>> DEBUG_ONLY(jshort new_value =) Atomic::dec(...);
>>>>> assert(new_value != -1, "...");
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On 24/08/2016 1:44 PM, Ioi Lam wrote:
>>>>>>
>>>>>>
>>>>>> On 8/23/16 7:54 PM, David Holmes wrote:
>>>>>>> On 24/08/2016 12:24 PM, Ioi Lam wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/23/16 6:51 PM, David Holmes wrote:
>>>>>>>>> On 24/08/2016 11:32 AM, Ioi Lam wrote:
>>>>>>>>>> On 8/23/16 6:08 PM, David Holmes wrote:
>>>>>>>>>>> On 24/08/2016 11:03 AM, Ioi Lam wrote:
>>>>>>>>>>>> Hi Coleen, thanks for suggestion the simplification:
>>>>>>>>>>>>
>>>>>>>>>>>> void Symbol::decrement_refcount() {
>>>>>>>>>>>> #ifdef ASSERT
>>>>>>>>>>>>   if (_refcount == 0) {
>>>>>>>>>>>>       print();
>>>>>>>>>>>>       assert(false, "reference count underflow for symbol");
>>>>>>>>>>>>     }
>>>>>>>>>>>>   }
>>>>>>>>>>>> #endif
>>>>>>>>>>>>   Atomic::dec(&_refcount);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> There's a race condition that won't detect the underflow. 
>>>>>>>>>>>> E.g.,
>>>>>>>>>>>> refcount
>>>>>>>>>>>> is 1. Two threads comes in and decrement at the same time. We
>>>>>>>>>>>> will end
>>>>>>>>>>>> up with -1.
>>>>>>>>>>>
>>>>>>>>>>> So if we're going this path then you can get rid of the race by
>>>>>>>>>>> using
>>>>>>>>>>> Atomic::add(&_refcount, -1), which returns the updated 
>>>>>>>>>>> value. If
>>>>>>>>>>> you
>>>>>>>>>>> get back -1 then assert.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The problem is we allow -1 to mean "permanent". Symbols in 
>>>>>>>>>> the CDS
>>>>>>>>>> archive are mapped read-only and have refcount==-1. If we 
>>>>>>>>>> decrement
>>>>>>>>>> them
>>>>>>>>>> we get a SEGV.
>>>>>>>>>
>>>>>>>>> Unless you set them using the decrement_refcount function I 
>>>>>>>>> don't see
>>>>>>>>> how this is a problem. The assertion would only trigger if we
>>>>>>>>> decrement from zero to -1.
>>>>>>>>>
>>>>>>>>>> Also, if we blindly decrement the refcount, we could end up 
>>>>>>>>>> rolling
>>>>>>>>>> back
>>>>>>>>>> all the way from 0xffff to 0x0, at which point we may free the
>>>>>>>>>> Symbol
>>>>>>>>>> which may still be in use.
>>>>>>>>>
>>>>>>>>> ?? As soon as a bad decrement happens the assert is trigerred 
>>>>>>>>> and the
>>>>>>>>> VM is dead.
>>>>>>>>
>>>>>>>> The assert doesn't happen in product VM. Currently the product 
>>>>>>>> VM will
>>>>>>>> continue to work, and all symbols with underflown/overflown 
>>>>>>>> refcounts
>>>>>>>> will be considered as permanent.
>>>>>>>
>>>>>>> Okay - note this bug is about an assertion failure :)
>>>>>>>
>>>>>>> So there are two problems:
>>>>>>>
>>>>>>> 1. There is a race related to the assert that the 
>>>>>>> Atomic::add(-1) can
>>>>>>> fix.
>>>>>>>
>>>>>>> 2. There is the product-mode unconditional decrementing of the
>>>>>>> refcount which will mark everything as "permanent". I haven't 
>>>>>>> see any
>>>>>>> suggestions to address that existing issue. One obvious 
>>>>>>> suggestion is
>>>>>>> to use a different value than -1 so it doesn't easily get hit by
>>>>>>> unexpected underflow.
>>>>>>>
>>>>>>
>>>>>> Here's an updated version that's an improvement over my original 
>>>>>> patch:
>>>>>>
>>>>>> + A negative refcount means permanent symbols.
>>>>>> + Only non-permanent symbols can be incremented/decremented
>>>>>> + Underflow is detected by 0 -> -1 transition. Real underflows are
>>>>>> always caught.
>>>>>> + Theoretically there's still a race condition for false asserts:
>>>>>>
>>>>>> [1] thread A observes that refcount is 0x7fff, proceeds to 
>>>>>> decrement it
>>>>>> [2] but in the mean time, around 32768 threads come in. They all 
>>>>>> observe
>>>>>> that the refcount is non-negative, and then all proceed to 
>>>>>> increment the
>>>>>> refcount to 0x0.
>>>>>> [3] thread A decrements the refcount and observes that refcount 
>>>>>> is -1
>>>>>> afterwards
>>>>>>
>>>>>> but I think this can be safely ignored :-)
>>>>>>
>>>>>> http://cr.openjdk.java.net/~iklam/jdk9/8161280-symbol-refcount-underflow.v02/ 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I like this version as it doesn't change the logic for the 
>>>>>> product VM
>>>>>> and only limits the possibility of false asserts.
>>>>>>
>>>>>> Thanks
>>>>>> - Ioi
>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>>> - Ioi
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>> -----
>>>>>>>>>
>>>>>>>>>> The fundamental problem is we need a two-step operation and 
>>>>>>>>>> we avoid
>>>>>>>>>> proper synchronization (for performance/avoiding deadlock/etc).
>>>>>>>>>> So we
>>>>>>>>>> will always have a race condition somewhere, and we just need to
>>>>>>>>>> allow
>>>>>>>>>> only the benign ones.
>>>>>>>>>>
>>>>>>>>>> - Ioi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>> However, it's not worse than before. The old version also 
>>>>>>>>>>>> has a
>>>>>>>>>>>> race
>>>>>>>>>>>> condition:
>>>>>>>>>>>>
>>>>>>>>>>>>     refcount is 0
>>>>>>>>>>>>     thread A decrements
>>>>>>>>>>>>     thread B increments
>>>>>>>>>>>>     thread A checks for underflow
>>>>>>>>>>>>
>>>>>>>>>>>> the decrementing thread will read _refcount==0 at the end 
>>>>>>>>>>>> so it
>>>>>>>>>>>> won't
>>>>>>>>>>>> detect the (transient) underflow.
>>>>>>>>>>>>
>>>>>>>>>>>> I think the failure to detect underflow is fine, since this
>>>>>>>>>>>> happens
>>>>>>>>>>>> only
>>>>>>>>>>>> with concurrent access. The kinds of underflow that we are
>>>>>>>>>>>> interested
>>>>>>>>>>>> usually can be caught in single-threaded situations.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> - Ioi
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8/23/16 4:24 PM, Coleen Phillimore wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> This doesn't make sense for me and I have to go in gdb to
>>>>>>>>>>>>> print out
>>>>>>>>>>>>> what -16384 is.   It appears that this is trying to detect
>>>>>>>>>>>>> that we
>>>>>>>>>>>>> went below zero from zero, which is an error, but this isn't
>>>>>>>>>>>>> clear at
>>>>>>>>>>>>> all.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It seems that
>>>>>>>>>>>>>
>>>>>>>>>>>>>    if (_refcount >= 0) {
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Should be > 0 and we should assert if this is ever zero 
>>>>>>>>>>>>> instead,
>>>>>>>>>>>>> and
>>>>>>>>>>>>> allow anything negative to mean that this count has gone
>>>>>>>>>>>>> immortal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kim thought it should use CAS rather than atomic increment 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> decrement, but maybe that isn't necessary, especially 
>>>>>>>>>>>>> since there
>>>>>>>>>>>>> isn't a short version of cmpxchg.
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> Coleen
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 8/23/16 6:01 AM, Ioi Lam wrote:
>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8161280
>>>>>>>>>>>>>> http://cr.openjdk.java.net/~iklam/jdk9/8161280-symbol-refcount-underflow.v01/ 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Summary:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The test was loading a lot of JCK classes into the same VM.
>>>>>>>>>>>>>> Many of
>>>>>>>>>>>>>> the JCK classes refer to "javasoft/sqe/javatest/Status", 
>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>> refcount (a signed short integer) of this Symbol would 
>>>>>>>>>>>>>> run up
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> past 0x7fff.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The assert was caused by a race condition: the refcount 
>>>>>>>>>>>>>> started
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> a large (16-bit) positive value such as 0x7fff, one 
>>>>>>>>>>>>>> thread is
>>>>>>>>>>>>>> decrementing and several other threads are incrementing. The
>>>>>>>>>>>>>> refcount
>>>>>>>>>>>>>> will end up being 0x8000 or slightly higher (limited to the
>>>>>>>>>>>>>> number of
>>>>>>>>>>>>>> concurrent threads that are running within a small window of
>>>>>>>>>>>>>> several
>>>>>>>>>>>>>> instructions in the decrementing thread, so most likely it
>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>> 0x800?).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As a result, the decrementing thread found that the 
>>>>>>>>>>>>>> refecount is
>>>>>>>>>>>>>> negative after the operation, and thought that an 
>>>>>>>>>>>>>> underflow had
>>>>>>>>>>>>>> happened.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The fix is to ignore any value that may appear in the 
>>>>>>>>>>>>>> [0x8000 -
>>>>>>>>>>>>>> 0xbfff] range and do not flag these as underflows (since 
>>>>>>>>>>>>>> they
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> most likely overflows -- overflows are already handled by
>>>>>>>>>>>>>> making the
>>>>>>>>>>>>>> Symbol permanent).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> - Ioi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>



More information about the hotspot-runtime-dev mailing list