RFR(s): 8055239: assert(_thread == Thread::current()->osthread()) failed: The PromotionFailedInfo should be thread local.

Mon Nov 24 09:10:23 UTC 2014

On 2014-11-24 08:32, Jesper Wilhelmsson wrote:
> Hi,
>
> I think the thread info can be useful in some cases, for instance to 
> see if the same thread gets multiple promotion failures or if it is 
> separate threads. If it is decided to remove this field please also 
> consider how to treat it in G1 where it was supposed to be added. 
> https://bugs.openjdk.java.net/browse/JDK-8012217
>
> Also, consult Bengt before removing the field. I remember that he had 
> opinions around the _threads field when I implemented this.

Here's what I remember from that discussion.

The promotion failed events are not sent at the time or by the thread 
where the promotion failure happens. Instead the information about the 
promotion failure is collected into a PromotionFailedInfo object for 
each thread (stored in ParScanThreadState). The main GC thread then 
iterates over all thread states and sends the actual events at the end 
of a GC. See ParScanThreadStateSet::trace_promotion_failed().

This means that the thread that sends the events is the main GC thread, 
which is not the same as the thread that experienced the promotion 
failure (which is one or several of the GC worker threads).

So, removing the _thread instance variable can not be done without 
affecting how JFR works. I think it may be useful information to have 
the thread information available, but one alternative would be to just 
not report a thread with the promotion failure.

A separate approach would be to send the promotion failed events from 
the GC worker threads rather than postponing the event to the end of the 
GC. That is also a slight change to the current JFR implementation but 
may be considered an improvement. However, my recollection is that we 
opted for the postponed solution because there were some issues with 
trying to send the events at the time when they occur. It may just have 
been that it was difficult to propagate the GC tracer to the right places.

Hths,
Bengt

> /Jesper
>
>
> Sangheon Kim skrev 24/11/14 06:47:
>> Hi Kim,
>>
>> Thank you for reviewing this!
>> And this detailed answer.
>>
>> On 11/22/2014 12:45 PM, Kim Barrett wrote:
>>> On Nov 21, 2014, at 12:35 AM, Sangheon Kim <sangheon.kim at oracle.com> 
>>> wrote:
>>>> Please review this change to reset PromotionFailedInfo before use:
>>>> https://bugs.openjdk.java.net/browse/JDK-8055239
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~sangheki/8055239/webrev.00/
>>>>
>>>> Summary:
>>>> This issue happens only with ParallelRefProcEnabled.
>>>> And this assert happens as the current thread (when second 
>>>> promotion failure happened) is different from saved thread (when 
>>>> first promotion failure happened).
>>>>  From ParNewGeneration::collect(), there are two mt processing, for 
>>>> live object processing (by ParNewGenTask) and reference processing 
>>>> (by ParNewRefEnqueueTaskProxy). When first promotion failure 
>>>> happened from ParNewGenTask, thread pointer will be stored by 
>>>> PromotionFailedInfo::register_copy_failure(). And at second 
>>>> failure, assert happens as thread pointer is not guaranteed to be 
>>>> same.
>>>> As the thread counts increase, its frequency become higher.
>>>>
>>>> Fix:
>>>> Added ParScanThreadStateSet::trace_promotion_failed() to trace and 
>>>> reset promotion failure information instance.
>>>>
>>>> Test:
>>>> - jprt
>>>>
>>>> Thanks,
>>>> Sangheon
>>> This proposed change doesn't address what I think is the root cause of
>>> the problem, which appears to me to be the semantically questionable
>>> use of a single ParScanThreadStateSet for what are conceptually two
>>> distinct sets of threads - the original generation scanning threads
>>> and the reference processing threads.  (The same underlying OS threads
>>> are being used for both, but they are being run with completely
>>> different "tasks", and the ThreadState info is really per-task)
>> I think my proposal is same way as your second opinion.
>>>
>>> Besides the unexpected reassociation of PromotionFailedInfo objects
>>> (from owning ParScanThreadState objects) with new OS threads, it looks
>>> like there are probably other things that are "messed up" by this
>>> reuse, e.g. timing information and TASKQUEUE_STATS.
>> Thanks for catching up about TASKQUEUE_STATS.
>> Reset is really needed.
>>>
>>> Creating a second state set for use by reference processing would
>>> require changing the signature of handle_promotion_failed to deal with
>>> two sets, but that function is private and only called in one place,
>>> so that shouldn't be hard.
>> I didn't think about creating second ThreadState as original 
>> implementation is already considered about reusing it.
>> ParScanThreadStateSet::reset() is called before second use.
>>>
>>> The real cost would be in the runtime construction of a second state
>>> set.  I'm not sure whether that would be important or not.  It is also
>>> only needed when the ParallelRefProcEnabled option is true, since
>>> that's the case where the re-use occurs, which adds a bit of
>>> complexity.
>> Agree if second ThreadState is needed.
>>>
>>> One option might be to capture the information to be reported in a new
>>> object and reinitialize the ThreadState objects, capture additional
>>> information after the reuse (if needed), and pass both sets of
>>> captured information to handle_promotion_failed.
>> Yes, this a little bit same as mine.
>> The difference is that I wanted to report promotion failure before 
>> reuse not together.
>> And I think ThreadState is already reset just above line of my 
>> change. (ParScanThreadStateSet::reset())
>> So I just added to reset/trace PromotionFailedInfo.
>>
>> Can you suggest some 'additional information'? It would be nice if 
>> add more information.
>>
>>>
>>> An entirely different approach to addressing just the proximate
>>> failure would be to eliminate the _thread member from
>>> PromotionFailureInfo and deal with the consequences of that change.  I
>>> can't see any utility in reporting the OS thread associated with an
>>> internal worker task when reporting promotion failure statistics.  The
>>> relevant context is long gone, after all.  And really, the ThreadState
>>> objects exist to avoid race conditions amongst the multiple
>>> threads. [I think what *should* be reported is a combined summary
>>> rather than per-thread promotion failure statistics.]
>>>
>>> Removing the _thread member has the added benefit of being a code
>>> removal and simplficiation change.  The downside is that it affects a
>>> messaging API; I'm not sure what would be involved in dealing with
>>> that.  The needed changes to the (open) jdk look straightforward:
>>>
>>> remove thread
>>> - src/share/vm/gc_implementation/shared/copyFailedInfo.hpp
>>>    Remove _thread member and associated code from PromotionFailedInfo.
>>>
>>> - hotspot/src/share/vm/gc_implementation/share/gcTraceSend.cpp
>>>    Remove thread field from constructed message.
>>>
>>> - hotspot/src/share/vm/trace/trace.xml
>>>    Remove thread field from PromotionFailed event description.
>>>    This is the API change that needs to be appropriately handled.
>> Oh. I didn't think about this approach since I do believe there are 
>> some reason of saving thread id and deliver it by 
>> YoungGCTracer::report_promotion_failed.
>> This issue is occurred because of thread id but it doesn't mean that 
>> this thread id information is not needed.
>>
>> Thanks,
>> Sangheon
>>
>