[14] RFR 8235829: graal crashes with Zombie.java test
coleen.phillimore at oracle.com
coleen.phillimore at oracle.com
Wed Dec 18 22:06:08 UTC 2019
Thanks Serguei!
Coleen
On 12/18/19 1:33 PM, serguei.spitsyn at oracle.com wrote:
> Hi Coleen,
>
> Just wanted to confirm the webrev V2 version looks okay to me.
> Sorry for replying on the wrong mailing thread.
>
> Thanks,
> Serguei
>
>
> On 12/18/19 08:42, coleen.phillimore at oracle.com wrote:
>>
>>
>> On 12/18/19 8:45 AM, David Holmes wrote:
>>> Thanks for the additional info Coleen!
>>>
>>> Just to add a bit more to the initialization history. The
>>> ServiceThread is a generalization of the LowMemoryDetectorThread
>>> that was part of the management API, and so it was initialized in
>>> Management::initialize. When it turned into the ServiceThread - to
>>> process JVMTI deferred events in addition to the low-memory-detector
>>> events - the initialization placement remained the same. Then later
>>> the INCLUDE_MANAGEMENT guards were added (JDK-7189254, October
>>> 2012). Later still we started adding other items of work for the
>>> ServiceThread. The earliest was the AllocationContextService
>>> notification in September 2014 but as that no longer exists I can't
>>> tell if that was the first non-management related use. Then the
>>> StringTable use was added 18 months ago - which definitely was
>>> outside the realm of the management API. So that is when the
>>> MinimalVM was first "broken". So it is good that is fixed.
>>>
>>> With regard to the placement in the initialization order, my
>>> remaining concern was with JVMTI event processing that might happen
>>> via events generated very early in the init sequence. But you have
>>> now modified things so that we will only process events in the LIVE
>>> phase, which only activates after all the class library
>>> initialization is complete.
>>>
>>> So overall I'm no longer significantly concerned about the change to
>>> the initialization order as I think you have it all covered. Thanks
>>> for bearing with me and all the off-list discussion.
>>
>> Thank you for having this discussion with me and provoking me to
>> recheck the ServiceThread. I think we can do further work to
>> future-proof initialization order but the design needs to be improved.
>>
>> Thanks for reviewing,
>> Coleen
>>>
>>> Cheers,
>>> David
>>> -----
>>>
>>> On 18/12/2019 1:27 am, coleen.phillimore at oracle.com wrote:
>>>>
>>>>
>>>> On 12/16/19 11:04 PM, David Holmes wrote:
>>>>> Clarification ...
>>>>>
>>>>> On 17/12/2019 12:40 pm, coleen.phillimore at oracle.com wrote:
>>>>>>
>>>>>> Short answer below.
>>>>>>
>>>>>> On 12/16/19 5:51 PM, David Holmes wrote:
>>>>>>> Hi Coleen,
>>>>>>>
>>>>>>> A quick initial response ...
>>>>>>>
>>>>>>> On 16/12/2019 11:26 pm, coleen.phillimore at oracle.com wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/16/19 8:04 AM, David Holmes wrote:
>>>>>>>>> Hi Coleen,
>>>>>>>>>
>>>>>>>>> On 16/12/2019 9:41 pm, coleen.phillimore at oracle.com wrote:
>>>>>>>>>> Summary: Start ServiceThread before compiler threads, and run
>>>>>>>>>> nmethod barriers for zgc before adding to the service thread
>>>>>>>>>> queue, or posting the events on the java thread queue.
>>>>>>>>>
>>>>>>>>> I can't comment on most of this but the earlier starting of
>>>>>>>>> the service thread has some concerns:
>>>>>>>>>
>>>>>>>>> - there is a lot of JDK level initialization which now will
>>>>>>>>> not have happened before the service thread is started and it
>>>>>>>>> is far from obvious that all possible initialization
>>>>>>>>> dependencies will be satisfied
>>>>>>>>
>>>>>>>> I agree that the order of initialization is very sensitive.
>>>>>>>> From the actions that the service thread does, the one that I
>>>>>>>> found was a problem was that events were posted before the LIVE
>>>>>>>> phase (see comment in has_events()), which could have happened
>>>>>>>> with the existing code, but the window for the race is a lot
>>>>>>>> smaller. The other actions can be run if there's a GC before
>>>>>>>> initialization but would be a bug in the initialization code,
>>>>>>>> and I didn't find these bugs in all my testing. There are some
>>>>>>>> ordering dependencies that do have odd side effects (between
>>>>>>>> the compiler thread startup and initialization jsr292 classes)
>>>>>>>> which have comments. This patch doesn't touch those.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> - current starting of the service thread in
>>>>>>>>> Management::initialize is guarded by "#if INCLUDE_MANAGEMENT",
>>>>>>>>> but now you are starting the service thread unconditionally
>>>>>>>>> for all builds. Hmm just saw your latest comment to the bug
>>>>>>>>> report - so the service thread is now (for quite some time?)
>>>>>>>>> being used for other than management tasks and so should
>>>>>>>>> always be present even if INCLUDE_MANAGEMENT is not enabled.
>>>>>>>>> Is that sufficient or are there likely to be other changes
>>>>>>>>> needed to actually ensure that all works correctly e.g. any
>>>>>>>>> code the service thread executes that is only defined for
>>>>>>>>> INCLUDE_MANAGEMENT will need to be compiled out explicitly.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I asked Jie offline to check the minimal build. I don't think
>>>>>>>> there are other INCLUDE_MANAGEMENT actions in the service
>>>>>>>> thread and I'm not sure why it was initialized there in the
>>>>>>>> first place. The minimal vm would have been broken ie.
>>>>>>>> hashtables would not have been cleaned up, etc, but I'm not
>>>>>>>> sure how well that is tested or if one would notice.
>>>>>>>>> - the service thread and the notification thread are (were?)
>>>>>>>>> closely related but now started at completely different times
>>>>>>>>
>>>>>>>> The notification thread is limited to "services" so it makes
>>>>>>>> sense where it is. The ServiceThread does lots of other
>>>>>>>> things. Maybe it needs renaming in 15.
>>>>>>>>>
>>>>>>>>> The bug report states the problem as:
>>>>>>>>>
>>>>>>>>> "The graal crash is because compiled_method_load events are
>>>>>>>>> added to the ServiceThread's deferred event queue before the
>>>>>>>>> ServiceThread is created so are not walked to keep them from
>>>>>>>>> being zombied."
>>>>>>>>>
>>>>>>>>> so why isn't the solution to ensure the deferred event queue
>>>>>>>>> is walked? I'm not clear how starting the service thread
>>>>>>>>> relates to walking the queue.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The service thread is responsible for walking the deferred
>>>>>>>> event queue. See ServiceThread::oops_do/nmethods_do. The
>>>>>>>> design could be changed to have some global walk somewhere of
>>>>>>>> this queue, but essentially this queue is processed by the
>>>>>>>> service thread.
>>>>>>>
>>>>>>> Sorry I don't follow. I thought "oops_do" and friends are for
>>>>>>> the GC threads and/or VMThread to call to process oops when GC
>>>>>>> updates them.
>>>>>>
>>>>>> The oops_do and nmethods_do() can be called by a thread walk in
>>>>>> handshakes (by the sweeper thread) and by parallel GC thread
>>>>>> walks. There isn't a single entry to do the thread-specific
>>>>>> closures that we need to do for these deferred event queues. I
>>>>>> tried a version that walked the queues with a static call but
>>>>>> missed some places where it would be needed to make this call
>>>>>> (didn't work). Keeping this associated with the ServiceThread
>>>>>> simplifies a lot.
>>>>>
>>>>> Just to clarify that further, the thread walk requires the thread
>>>>> appears in ALL_JAVA_THREADS but that only happens after the
>>>>> ServiceThread has been started. So in essence we don't really need
>>>>> the ServiceThread to have commenced execution earlier, but we need
>>>>> it to have been created. Those two steps are combined in practice.
>>>>
>>>> Yes. Then the ServiceThread waits on the Service_lock until
>>>> notified by these events:
>>>>
>>>> while (((sensors_changed = (!UseNotificationThread &&
>>>> LowMemoryDetector::has_pending_requests())) |
>>>> (has_jvmti_events =
>>>> _jvmti_service_queue.has_events()) |
>>>> (has_gc_notification_event = (!UseNotificationThread
>>>> && GCNotifier::has_event())) |
>>>> (has_dcmd_notification_event =
>>>> (!UseNotificationThread &&
>>>> DCmdFactory::has_pending_jmx_notification())) |
>>>> (stringtable_work = StringTable::has_work()) |
>>>> (symboltable_work = SymbolTable::has_work()) |
>>>> (resolved_method_table_work =
>>>> ResolvedMethodTable::has_work()) |
>>>> (thread_id_table_work = ThreadIdTable::has_work()) |
>>>> (protection_domain_table_work =
>>>> SystemDictionary::pd_cache_table()->has_work()) |
>>>> (oopstorage_work =
>>>> OopStorage::has_cleanup_work_and_reset())
>>>> ) == 0) {
>>>>
>>>> The first, third and fourth events are from management.cpp events
>>>> that were initialized after the ServiceThread was started.
>>>> The second event I have changed, to wait until LIVE phase to return
>>>> true.
>>>> The stringtable, symboltable, resolved_method_table, thread_id and
>>>> pd table have static _has_work variables initialized to false.
>>>> The oopstorage_work has similar, but may update a time-based
>>>> counter a bit earlier with the service thread starting earlier. I
>>>> think this is harmless.
>>>>
>>>> It is possible that after the service thread starts and before the
>>>> compiler thread starts, there could be a GC that notifies the
>>>> stringtable to clean up. This seems like a good thing that the GC
>>>> would clean up these tables with this order. I ran the tier4 graal
>>>> tests and there were no failures.
>>>>
>>>> Thanks,
>>>> Coleen
>>>>>
>>>>> Cheers,
>>>>> David
>>>>>
>>>>>> thanks,
>>>>>> Coleen
>>>>>>
>>>>>>>
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>> I had an additional change to make the queue non-static but
>>>>>>>> want to limit the change at this point.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Coleen
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>> See bug for description of the problems found with the new
>>>>>>>>>> Zombie.java test.
>>>>>>>>>>
>>>>>>>>>> open webrev at
>>>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8235829.01/webrev
>>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8235829
>>>>>>>>>>
>>>>>>>>>> Ran tier1 all platforms, and tier2-8 testing, as well as
>>>>>>>>>> rerunning original test failure from bug
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Coleen
>>>>>>>>
>>>>>>
>>>>
>>
>
More information about the serviceability-dev
mailing list