[14] RFR 8235829: graal crashes with Zombie.java test
coleen.phillimore at oracle.com
coleen.phillimore at oracle.com
Tue Dec 17 15:27:08 UTC 2019
On 12/16/19 11:04 PM, David Holmes wrote:
> Clarification ...
>
> On 17/12/2019 12:40 pm, coleen.phillimore at oracle.com wrote:
>>
>> Short answer below.
>>
>> On 12/16/19 5:51 PM, David Holmes wrote:
>>> Hi Coleen,
>>>
>>> A quick initial response ...
>>>
>>> On 16/12/2019 11:26 pm, coleen.phillimore at oracle.com wrote:
>>>>
>>>>
>>>> On 12/16/19 8:04 AM, David Holmes wrote:
>>>>> Hi Coleen,
>>>>>
>>>>> On 16/12/2019 9:41 pm, coleen.phillimore at oracle.com wrote:
>>>>>> Summary: Start ServiceThread before compiler threads, and run
>>>>>> nmethod barriers for zgc before adding to the service thread
>>>>>> queue, or posting the events on the java thread queue.
>>>>>
>>>>> I can't comment on most of this but the earlier starting of the
>>>>> service thread has some concerns:
>>>>>
>>>>> - there is a lot of JDK level initialization which now will not
>>>>> have happened before the service thread is started and it is far
>>>>> from obvious that all possible initialization dependencies will be
>>>>> satisfied
>>>>
>>>> I agree that the order of initialization is very sensitive. From
>>>> the actions that the service thread does, the one that I found was
>>>> a problem was that events were posted before the LIVE phase (see
>>>> comment in has_events()), which could have happened with the
>>>> existing code, but the window for the race is a lot smaller. The
>>>> other actions can be run if there's a GC before initialization but
>>>> would be a bug in the initialization code, and I didn't find these
>>>> bugs in all my testing. There are some ordering dependencies that
>>>> do have odd side effects (between the compiler thread startup and
>>>> initialization jsr292 classes) which have comments. This patch
>>>> doesn't touch those.
>>>>
>>>>>
>>>>> - current starting of the service thread in Management::initialize
>>>>> is guarded by "#if INCLUDE_MANAGEMENT", but now you are starting
>>>>> the service thread unconditionally for all builds. Hmm just saw
>>>>> your latest comment to the bug report - so the service thread is
>>>>> now (for quite some time?) being used for other than management
>>>>> tasks and so should always be present even if INCLUDE_MANAGEMENT
>>>>> is not enabled. Is that sufficient or are there likely to be other
>>>>> changes needed to actually ensure that all works correctly e.g.
>>>>> any code the service thread executes that is only defined for
>>>>> INCLUDE_MANAGEMENT will need to be compiled out explicitly.
>>>>>
>>>>
>>>> I asked Jie offline to check the minimal build. I don't think
>>>> there are other INCLUDE_MANAGEMENT actions in the service thread
>>>> and I'm not sure why it was initialized there in the first place.
>>>> The minimal vm would have been broken ie. hashtables would not have
>>>> been cleaned up, etc, but I'm not sure how well that is tested or
>>>> if one would notice.
>>>>> - the service thread and the notification thread are (were?)
>>>>> closely related but now started at completely different times
>>>>
>>>> The notification thread is limited to "services" so it makes sense
>>>> where it is. The ServiceThread does lots of other things. Maybe
>>>> it needs renaming in 15.
>>>>>
>>>>> The bug report states the problem as:
>>>>>
>>>>> "The graal crash is because compiled_method_load events are added
>>>>> to the ServiceThread's deferred event queue before the
>>>>> ServiceThread is created so are not walked to keep them from being
>>>>> zombied."
>>>>>
>>>>> so why isn't the solution to ensure the deferred event queue is
>>>>> walked? I'm not clear how starting the service thread relates to
>>>>> walking the queue.
>>>>>
>>>>
>>>> The service thread is responsible for walking the deferred event
>>>> queue. See ServiceThread::oops_do/nmethods_do. The design could
>>>> be changed to have some global walk somewhere of this queue, but
>>>> essentially this queue is processed by the service thread.
>>>
>>> Sorry I don't follow. I thought "oops_do" and friends are for the GC
>>> threads and/or VMThread to call to process oops when GC updates them.
>>
>> The oops_do and nmethods_do() can be called by a thread walk in
>> handshakes (by the sweeper thread) and by parallel GC thread walks.
>> There isn't a single entry to do the thread-specific closures that we
>> need to do for these deferred event queues. I tried a version that
>> walked the queues with a static call but missed some places where it
>> would be needed to make this call (didn't work). Keeping this
>> associated with the ServiceThread simplifies a lot.
>
> Just to clarify that further, the thread walk requires the thread
> appears in ALL_JAVA_THREADS but that only happens after the
> ServiceThread has been started. So in essence we don't really need the
> ServiceThread to have commenced execution earlier, but we need it to
> have been created. Those two steps are combined in practice.
Yes. Then the ServiceThread waits on the Service_lock until notified by
these events:
while (((sensors_changed = (!UseNotificationThread &&
LowMemoryDetector::has_pending_requests())) |
(has_jvmti_events = _jvmti_service_queue.has_events()) |
(has_gc_notification_event = (!UseNotificationThread &&
GCNotifier::has_event())) |
(has_dcmd_notification_event = (!UseNotificationThread &&
DCmdFactory::has_pending_jmx_notification())) |
(stringtable_work = StringTable::has_work()) |
(symboltable_work = SymbolTable::has_work()) |
(resolved_method_table_work =
ResolvedMethodTable::has_work()) |
(thread_id_table_work = ThreadIdTable::has_work()) |
(protection_domain_table_work =
SystemDictionary::pd_cache_table()->has_work()) |
(oopstorage_work = OopStorage::has_cleanup_work_and_reset())
) == 0) {
The first, third and fourth events are from management.cpp events that
were initialized after the ServiceThread was started.
The second event I have changed, to wait until LIVE phase to return true.
The stringtable, symboltable, resolved_method_table, thread_id and pd
table have static _has_work variables initialized to false.
The oopstorage_work has similar, but may update a time-based counter a
bit earlier with the service thread starting earlier. I think this is
harmless.
It is possible that after the service thread starts and before the
compiler thread starts, there could be a GC that notifies the
stringtable to clean up. This seems like a good thing that the GC would
clean up these tables with this order. I ran the tier4 graal tests and
there were no failures.
Thanks,
Coleen
>
> Cheers,
> David
>
>> thanks,
>> Coleen
>>
>>>
>>> David
>>> -----
>>>
>>>> I had an additional change to make the queue non-static but want to
>>>> limit the change at this point.
>>>>
>>>> Thanks,
>>>> Coleen
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>> See bug for description of the problems found with the new
>>>>>> Zombie.java test.
>>>>>>
>>>>>> open webrev at
>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8235829.01/webrev
>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8235829
>>>>>>
>>>>>> Ran tier1 all platforms, and tier2-8 testing, as well as
>>>>>> rerunning original test failure from bug
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361.
>>>>>>
>>>>>> Thanks,
>>>>>> Coleen
>>>>
>>
More information about the serviceability-dev
mailing list