[14] RFR 8235829: graal crashes with Zombie.java test

Wed Dec 18 16:42:45 UTC 2019

On 12/18/19 8:45 AM, David Holmes wrote:
> Thanks for the additional info Coleen!
>
> Just to add a bit more to the initialization history. The 
> ServiceThread is a generalization of the LowMemoryDetectorThread that 
> was part of the management API, and so it was initialized in 
> Management::initialize. When it turned into the ServiceThread - to 
> process JVMTI deferred events in addition to the low-memory-detector 
> events - the initialization placement remained the same. Then later 
> the INCLUDE_MANAGEMENT guards were added (JDK-7189254, October 2012). 
> Later still we started adding other items of work for the 
> ServiceThread. The earliest was the AllocationContextService 
> notification in September 2014 but as that no longer exists I can't 
> tell if that was the first non-management related use. Then the 
> StringTable use was added 18 months ago - which definitely was outside 
> the realm of the management API. So that is when the MinimalVM was 
> first "broken". So it is good that is fixed.
>
> With regard to the placement in the initialization order, my remaining 
> concern was with JVMTI event processing that might happen via events 
> generated very early in the init sequence. But you have now modified 
> things so that we will only process events in the LIVE phase, which 
> only activates after all the class library initialization is complete.
>
> So overall I'm no longer significantly concerned about the change to 
> the initialization order as I think you have it all covered. Thanks 
> for bearing with me and all the off-list discussion.

Thank you for having this discussion with me and provoking me to recheck 
the ServiceThread.   I think we can do further work to future-proof 
initialization order but the design needs to be improved.

Thanks for reviewing,
Coleen
>
> Cheers,
> David
> -----
>
> On 18/12/2019 1:27 am, coleen.phillimore at oracle.com wrote:
>>
>>
>> On 12/16/19 11:04 PM, David Holmes wrote:
>>> Clarification ...
>>>
>>> On 17/12/2019 12:40 pm, coleen.phillimore at oracle.com wrote:
>>>>
>>>> Short answer below.
>>>>
>>>> On 12/16/19 5:51 PM, David Holmes wrote:
>>>>> Hi Coleen,
>>>>>
>>>>> A quick initial response ...
>>>>>
>>>>> On 16/12/2019 11:26 pm, coleen.phillimore at oracle.com wrote:
>>>>>>
>>>>>>
>>>>>> On 12/16/19 8:04 AM, David Holmes wrote:
>>>>>>> Hi Coleen,
>>>>>>>
>>>>>>> On 16/12/2019 9:41 pm, coleen.phillimore at oracle.com wrote:
>>>>>>>> Summary: Start ServiceThread before compiler threads, and run 
>>>>>>>> nmethod barriers for zgc before adding to the service thread 
>>>>>>>> queue, or posting the events on the java thread queue.
>>>>>>>
>>>>>>> I can't comment on most of this but the earlier starting of the 
>>>>>>> service thread has some concerns:
>>>>>>>
>>>>>>> - there is a lot of JDK level initialization which now will not 
>>>>>>> have happened before the service thread is started and it is far 
>>>>>>> from obvious that all possible initialization dependencies will 
>>>>>>> be satisfied
>>>>>>
>>>>>> I agree that the order of initialization is very sensitive. From 
>>>>>> the actions that the service thread does, the one that I found 
>>>>>> was a problem was that events were posted before the LIVE phase 
>>>>>> (see comment in has_events()), which could have happened with the 
>>>>>> existing code, but the window for the race is a lot smaller.   
>>>>>> The other actions can be run if there's a GC before 
>>>>>> initialization but would be a bug in the initialization code, and 
>>>>>> I didn't find these bugs in all my testing. There are some 
>>>>>> ordering dependencies that do have odd side effects (between the 
>>>>>> compiler thread startup and initialization jsr292 classes) which 
>>>>>> have comments.  This patch doesn't touch those.
>>>>>>
>>>>>>>
>>>>>>> - current starting of the service thread in 
>>>>>>> Management::initialize is guarded by "#if INCLUDE_MANAGEMENT", 
>>>>>>> but now you are starting the service thread unconditionally for 
>>>>>>> all builds. Hmm just saw your latest comment to the bug report - 
>>>>>>> so the service thread is now (for quite some time?) being used 
>>>>>>> for other than management tasks and so should always be present 
>>>>>>> even if INCLUDE_MANAGEMENT is not enabled. Is that sufficient or 
>>>>>>> are there likely to be other changes needed to actually ensure 
>>>>>>> that all works correctly  e.g. any code the service thread 
>>>>>>> executes that is only defined for INCLUDE_MANAGEMENT will need 
>>>>>>> to be compiled out explicitly.
>>>>>>>
>>>>>>
>>>>>> I asked Jie offline to check the minimal build.  I don't think 
>>>>>> there are other INCLUDE_MANAGEMENT actions in the service thread 
>>>>>> and I'm not sure why it was initialized there in the first place. 
>>>>>> The minimal vm would have been broken ie. hashtables would not 
>>>>>> have been cleaned up, etc, but I'm not sure how well that is 
>>>>>> tested or if one would notice.
>>>>>>> - the service thread and the notification thread are (were?) 
>>>>>>> closely related but now started at completely different times
>>>>>>
>>>>>> The notification thread is limited to "services" so it makes 
>>>>>> sense where it is.  The ServiceThread does lots of other things.  
>>>>>> Maybe it needs renaming in 15.
>>>>>>>
>>>>>>> The bug report states the problem as:
>>>>>>>
>>>>>>> "The graal crash is because compiled_method_load events are 
>>>>>>> added to the ServiceThread's deferred event queue before the 
>>>>>>> ServiceThread is created so are not walked to keep them from 
>>>>>>> being zombied."
>>>>>>>
>>>>>>> so why isn't the solution to ensure the deferred event queue is 
>>>>>>> walked? I'm not clear how starting the service thread relates to 
>>>>>>> walking the queue.
>>>>>>>
>>>>>>
>>>>>> The service thread is responsible for walking the deferred event 
>>>>>> queue.   See ServiceThread::oops_do/nmethods_do.   The design 
>>>>>> could be changed to have some global walk somewhere of this 
>>>>>> queue, but essentially this queue is processed by the service 
>>>>>> thread.
>>>>>
>>>>> Sorry I don't follow. I thought "oops_do" and friends are for the 
>>>>> GC threads and/or VMThread to call to process oops when GC updates 
>>>>> them.
>>>>
>>>> The oops_do and nmethods_do() can be called by a thread walk in 
>>>> handshakes (by the sweeper thread) and by parallel GC thread walks. 
>>>> There isn't a single entry to do the thread-specific closures that 
>>>> we need to do for these deferred event queues.   I tried a version 
>>>> that walked the queues with a static call but missed some places 
>>>> where it would be needed to make this call (didn't work).  Keeping 
>>>> this associated with the ServiceThread simplifies a lot.
>>>
>>> Just to clarify that further, the thread walk requires the thread 
>>> appears in ALL_JAVA_THREADS but that only happens after the 
>>> ServiceThread has been started. So in essence we don't really need 
>>> the ServiceThread to have commenced execution earlier, but we need 
>>> it to have been created. Those two steps are combined in practice.
>>
>> Yes.  Then the ServiceThread waits on the Service_lock until notified 
>> by these events:
>>
>>        while (((sensors_changed = (!UseNotificationThread && 
>> LowMemoryDetector::has_pending_requests())) |
>>                (has_jvmti_events = _jvmti_service_queue.has_events()) |
>>                (has_gc_notification_event = (!UseNotificationThread 
>> && GCNotifier::has_event())) |
>>                (has_dcmd_notification_event = (!UseNotificationThread 
>> && DCmdFactory::has_pending_jmx_notification())) |
>>                (stringtable_work = StringTable::has_work()) |
>>                (symboltable_work = SymbolTable::has_work()) |
>>                (resolved_method_table_work = 
>> ResolvedMethodTable::has_work()) |
>>                (thread_id_table_work = ThreadIdTable::has_work()) |
>>                (protection_domain_table_work = 
>> SystemDictionary::pd_cache_table()->has_work()) |
>>                (oopstorage_work = 
>> OopStorage::has_cleanup_work_and_reset())
>>               ) == 0) {
>>
>> The first, third and fourth events are from management.cpp events 
>> that were initialized after the ServiceThread was started.
>> The second event I have changed, to wait until LIVE phase to return 
>> true.
>> The stringtable, symboltable, resolved_method_table, thread_id and pd 
>> table have static _has_work variables initialized to false.
>> The oopstorage_work has similar, but may update a time-based counter 
>> a bit earlier with the service thread starting earlier. I think this 
>> is harmless.
>>
>> It is possible that after the service thread starts and before the 
>> compiler thread starts, there could be a GC that notifies the 
>> stringtable to clean up.  This seems like a good thing that the GC 
>> would clean up these tables with this order.  I ran the tier4 graal 
>> tests and there were no failures.
>>
>> Thanks,
>> Coleen
>>>
>>> Cheers,
>>> David
>>>
>>>> thanks,
>>>> Coleen
>>>>
>>>>>
>>>>> David
>>>>> -----
>>>>>
>>>>>> I had an additional change to make the queue non-static but want 
>>>>>> to limit the change at this point.
>>>>>>
>>>>>> Thanks,
>>>>>> Coleen
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>> See bug for description of the problems found with the new 
>>>>>>>> Zombie.java test.
>>>>>>>>
>>>>>>>> open webrev at 
>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8235829.01/webrev
>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8235829
>>>>>>>>
>>>>>>>> Ran tier1 all platforms, and tier2-8 testing, as well as 
>>>>>>>> rerunning original test failure from bug 
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Coleen
>>>>>>
>>>>
>>