[14] RFR 8235829: graal crashes with Zombie.java test
coleen.phillimore at oracle.com
coleen.phillimore at oracle.com
Wed Dec 18 16:42:45 UTC 2019
On 12/18/19 8:45 AM, David Holmes wrote:
> Thanks for the additional info Coleen!
>
> Just to add a bit more to the initialization history. The
> ServiceThread is a generalization of the LowMemoryDetectorThread that
> was part of the management API, and so it was initialized in
> Management::initialize. When it turned into the ServiceThread - to
> process JVMTI deferred events in addition to the low-memory-detector
> events - the initialization placement remained the same. Then later
> the INCLUDE_MANAGEMENT guards were added (JDK-7189254, October 2012).
> Later still we started adding other items of work for the
> ServiceThread. The earliest was the AllocationContextService
> notification in September 2014 but as that no longer exists I can't
> tell if that was the first non-management related use. Then the
> StringTable use was added 18 months ago - which definitely was outside
> the realm of the management API. So that is when the MinimalVM was
> first "broken". So it is good that is fixed.
>
> With regard to the placement in the initialization order, my remaining
> concern was with JVMTI event processing that might happen via events
> generated very early in the init sequence. But you have now modified
> things so that we will only process events in the LIVE phase, which
> only activates after all the class library initialization is complete.
>
> So overall I'm no longer significantly concerned about the change to
> the initialization order as I think you have it all covered. Thanks
> for bearing with me and all the off-list discussion.
Thank you for having this discussion with me and provoking me to recheck
the ServiceThread. I think we can do further work to future-proof
initialization order but the design needs to be improved.
Thanks for reviewing,
Coleen
>
> Cheers,
> David
> -----
>
> On 18/12/2019 1:27 am, coleen.phillimore at oracle.com wrote:
>>
>>
>> On 12/16/19 11:04 PM, David Holmes wrote:
>>> Clarification ...
>>>
>>> On 17/12/2019 12:40 pm, coleen.phillimore at oracle.com wrote:
>>>>
>>>> Short answer below.
>>>>
>>>> On 12/16/19 5:51 PM, David Holmes wrote:
>>>>> Hi Coleen,
>>>>>
>>>>> A quick initial response ...
>>>>>
>>>>> On 16/12/2019 11:26 pm, coleen.phillimore at oracle.com wrote:
>>>>>>
>>>>>>
>>>>>> On 12/16/19 8:04 AM, David Holmes wrote:
>>>>>>> Hi Coleen,
>>>>>>>
>>>>>>> On 16/12/2019 9:41 pm, coleen.phillimore at oracle.com wrote:
>>>>>>>> Summary: Start ServiceThread before compiler threads, and run
>>>>>>>> nmethod barriers for zgc before adding to the service thread
>>>>>>>> queue, or posting the events on the java thread queue.
>>>>>>>
>>>>>>> I can't comment on most of this but the earlier starting of the
>>>>>>> service thread has some concerns:
>>>>>>>
>>>>>>> - there is a lot of JDK level initialization which now will not
>>>>>>> have happened before the service thread is started and it is far
>>>>>>> from obvious that all possible initialization dependencies will
>>>>>>> be satisfied
>>>>>>
>>>>>> I agree that the order of initialization is very sensitive. From
>>>>>> the actions that the service thread does, the one that I found
>>>>>> was a problem was that events were posted before the LIVE phase
>>>>>> (see comment in has_events()), which could have happened with the
>>>>>> existing code, but the window for the race is a lot smaller.
>>>>>> The other actions can be run if there's a GC before
>>>>>> initialization but would be a bug in the initialization code, and
>>>>>> I didn't find these bugs in all my testing. There are some
>>>>>> ordering dependencies that do have odd side effects (between the
>>>>>> compiler thread startup and initialization jsr292 classes) which
>>>>>> have comments. This patch doesn't touch those.
>>>>>>
>>>>>>>
>>>>>>> - current starting of the service thread in
>>>>>>> Management::initialize is guarded by "#if INCLUDE_MANAGEMENT",
>>>>>>> but now you are starting the service thread unconditionally for
>>>>>>> all builds. Hmm just saw your latest comment to the bug report -
>>>>>>> so the service thread is now (for quite some time?) being used
>>>>>>> for other than management tasks and so should always be present
>>>>>>> even if INCLUDE_MANAGEMENT is not enabled. Is that sufficient or
>>>>>>> are there likely to be other changes needed to actually ensure
>>>>>>> that all works correctly e.g. any code the service thread
>>>>>>> executes that is only defined for INCLUDE_MANAGEMENT will need
>>>>>>> to be compiled out explicitly.
>>>>>>>
>>>>>>
>>>>>> I asked Jie offline to check the minimal build. I don't think
>>>>>> there are other INCLUDE_MANAGEMENT actions in the service thread
>>>>>> and I'm not sure why it was initialized there in the first place.
>>>>>> The minimal vm would have been broken ie. hashtables would not
>>>>>> have been cleaned up, etc, but I'm not sure how well that is
>>>>>> tested or if one would notice.
>>>>>>> - the service thread and the notification thread are (were?)
>>>>>>> closely related but now started at completely different times
>>>>>>
>>>>>> The notification thread is limited to "services" so it makes
>>>>>> sense where it is. The ServiceThread does lots of other things.
>>>>>> Maybe it needs renaming in 15.
>>>>>>>
>>>>>>> The bug report states the problem as:
>>>>>>>
>>>>>>> "The graal crash is because compiled_method_load events are
>>>>>>> added to the ServiceThread's deferred event queue before the
>>>>>>> ServiceThread is created so are not walked to keep them from
>>>>>>> being zombied."
>>>>>>>
>>>>>>> so why isn't the solution to ensure the deferred event queue is
>>>>>>> walked? I'm not clear how starting the service thread relates to
>>>>>>> walking the queue.
>>>>>>>
>>>>>>
>>>>>> The service thread is responsible for walking the deferred event
>>>>>> queue. See ServiceThread::oops_do/nmethods_do. The design
>>>>>> could be changed to have some global walk somewhere of this
>>>>>> queue, but essentially this queue is processed by the service
>>>>>> thread.
>>>>>
>>>>> Sorry I don't follow. I thought "oops_do" and friends are for the
>>>>> GC threads and/or VMThread to call to process oops when GC updates
>>>>> them.
>>>>
>>>> The oops_do and nmethods_do() can be called by a thread walk in
>>>> handshakes (by the sweeper thread) and by parallel GC thread walks.
>>>> There isn't a single entry to do the thread-specific closures that
>>>> we need to do for these deferred event queues. I tried a version
>>>> that walked the queues with a static call but missed some places
>>>> where it would be needed to make this call (didn't work). Keeping
>>>> this associated with the ServiceThread simplifies a lot.
>>>
>>> Just to clarify that further, the thread walk requires the thread
>>> appears in ALL_JAVA_THREADS but that only happens after the
>>> ServiceThread has been started. So in essence we don't really need
>>> the ServiceThread to have commenced execution earlier, but we need
>>> it to have been created. Those two steps are combined in practice.
>>
>> Yes. Then the ServiceThread waits on the Service_lock until notified
>> by these events:
>>
>> while (((sensors_changed = (!UseNotificationThread &&
>> LowMemoryDetector::has_pending_requests())) |
>> (has_jvmti_events = _jvmti_service_queue.has_events()) |
>> (has_gc_notification_event = (!UseNotificationThread
>> && GCNotifier::has_event())) |
>> (has_dcmd_notification_event = (!UseNotificationThread
>> && DCmdFactory::has_pending_jmx_notification())) |
>> (stringtable_work = StringTable::has_work()) |
>> (symboltable_work = SymbolTable::has_work()) |
>> (resolved_method_table_work =
>> ResolvedMethodTable::has_work()) |
>> (thread_id_table_work = ThreadIdTable::has_work()) |
>> (protection_domain_table_work =
>> SystemDictionary::pd_cache_table()->has_work()) |
>> (oopstorage_work =
>> OopStorage::has_cleanup_work_and_reset())
>> ) == 0) {
>>
>> The first, third and fourth events are from management.cpp events
>> that were initialized after the ServiceThread was started.
>> The second event I have changed, to wait until LIVE phase to return
>> true.
>> The stringtable, symboltable, resolved_method_table, thread_id and pd
>> table have static _has_work variables initialized to false.
>> The oopstorage_work has similar, but may update a time-based counter
>> a bit earlier with the service thread starting earlier. I think this
>> is harmless.
>>
>> It is possible that after the service thread starts and before the
>> compiler thread starts, there could be a GC that notifies the
>> stringtable to clean up. This seems like a good thing that the GC
>> would clean up these tables with this order. I ran the tier4 graal
>> tests and there were no failures.
>>
>> Thanks,
>> Coleen
>>>
>>> Cheers,
>>> David
>>>
>>>> thanks,
>>>> Coleen
>>>>
>>>>>
>>>>> David
>>>>> -----
>>>>>
>>>>>> I had an additional change to make the queue non-static but want
>>>>>> to limit the change at this point.
>>>>>>
>>>>>> Thanks,
>>>>>> Coleen
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>> See bug for description of the problems found with the new
>>>>>>>> Zombie.java test.
>>>>>>>>
>>>>>>>> open webrev at
>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8235829.01/webrev
>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8235829
>>>>>>>>
>>>>>>>> Ran tier1 all platforms, and tier2-8 testing, as well as
>>>>>>>> rerunning original test failure from bug
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Coleen
>>>>>>
>>>>
>>
More information about the serviceability-dev
mailing list