RFR(s): 8055239: assert(_thread == Thread::current()->osthread()) failed: The PromotionFailedInfo should be thread local.

Mon Nov 24 05:47:03 UTC 2014

Hi Kim,

Thank you for reviewing this!
And this detailed answer.

On 11/22/2014 12:45 PM, Kim Barrett wrote:
> On Nov 21, 2014, at 12:35 AM, Sangheon Kim <sangheon.kim at oracle.com> wrote:
>> Please review this change to reset PromotionFailedInfo before use:
>> https://bugs.openjdk.java.net/browse/JDK-8055239
>>
>> Webrev:
>> http://cr.openjdk.java.net/~sangheki/8055239/webrev.00/
>>
>> Summary:
>> This issue happens only with ParallelRefProcEnabled.
>> And this assert happens as the current thread (when second promotion failure happened) is different from saved thread (when first promotion failure happened).
>>  From ParNewGeneration::collect(), there are two mt processing, for live object processing (by ParNewGenTask) and reference processing (by ParNewRefEnqueueTaskProxy). When first promotion failure happened from ParNewGenTask, thread pointer will be stored by PromotionFailedInfo::register_copy_failure(). And at second failure, assert happens as thread pointer is not guaranteed to be same.
>> As the thread counts increase, its frequency become higher.
>>
>> Fix:
>> Added ParScanThreadStateSet::trace_promotion_failed() to trace and reset promotion failure information instance.
>>
>> Test:
>> - jprt
>>
>> Thanks,
>> Sangheon
> This proposed change doesn't address what I think is the root cause of
> the problem, which appears to me to be the semantically questionable
> use of a single ParScanThreadStateSet for what are conceptually two
> distinct sets of threads - the original generation scanning threads
> and the reference processing threads.  (The same underlying OS threads
> are being used for both, but they are being run with completely
> different "tasks", and the ThreadState info is really per-task)
I think my proposal is same way as your second opinion.
>
> Besides the unexpected reassociation of PromotionFailedInfo objects
> (from owning ParScanThreadState objects) with new OS threads, it looks
> like there are probably other things that are "messed up" by this
> reuse, e.g. timing information and TASKQUEUE_STATS.
Thanks for catching up about TASKQUEUE_STATS.
Reset is really needed.
>
> Creating a second state set for use by reference processing would
> require changing the signature of handle_promotion_failed to deal with
> two sets, but that function is private and only called in one place,
> so that shouldn't be hard.
I didn't think about creating second ThreadState as original 
implementation is already considered about reusing it.
ParScanThreadStateSet::reset() is called before second use.
>
> The real cost would be in the runtime construction of a second state
> set.  I'm not sure whether that would be important or not.  It is also
> only needed when the ParallelRefProcEnabled option is true, since
> that's the case where the re-use occurs, which adds a bit of
> complexity.
Agree if second ThreadState is needed.
>
> One option might be to capture the information to be reported in a new
> object and reinitialize the ThreadState objects, capture additional
> information after the reuse (if needed), and pass both sets of
> captured information to handle_promotion_failed.
Yes, this a little bit same as mine.
The difference is that I wanted to report promotion failure before reuse 
not together.
And I think ThreadState is already reset just above line of my change. 
(ParScanThreadStateSet::reset())
So I just added to reset/trace PromotionFailedInfo.

Can you suggest some 'additional information'? It would be nice if add 
more information.

>
> An entirely different approach to addressing just the proximate
> failure would be to eliminate the _thread member from
> PromotionFailureInfo and deal with the consequences of that change.  I
> can't see any utility in reporting the OS thread associated with an
> internal worker task when reporting promotion failure statistics.  The
> relevant context is long gone, after all.  And really, the ThreadState
> objects exist to avoid race conditions amongst the multiple
> threads. [I think what *should* be reported is a combined summary
> rather than per-thread promotion failure statistics.]
>
> Removing the _thread member has the added benefit of being a code
> removal and simplficiation change.  The downside is that it affects a
> messaging API; I'm not sure what would be involved in dealing with
> that.  The needed changes to the (open) jdk look straightforward:
>
> remove thread
> - src/share/vm/gc_implementation/shared/copyFailedInfo.hpp
>    Remove _thread member and associated code from PromotionFailedInfo.
>
> - hotspot/src/share/vm/gc_implementation/share/gcTraceSend.cpp
>    Remove thread field from constructed message.
>
> - hotspot/src/share/vm/trace/trace.xml
>    Remove thread field from PromotionFailed event description.
>    This is the API change that needs to be appropriately handled.
Oh. I didn't think about this approach since I do believe there are some 
reason of saving thread id and deliver it by 
YoungGCTracer::report_promotion_failed.
This issue is occurred because of thread id but it doesn't mean that 
this thread id information is not needed.

Thanks,
Sangheon