[14] RFR 8235829: graal crashes with Zombie.java test

Wed Dec 18 22:06:08 UTC 2019

Thanks Serguei!
Coleen

On 12/18/19 1:33 PM, serguei.spitsyn at oracle.com wrote:
> Hi Coleen,
>
> Just wanted to confirm the webrev V2 version looks okay to me.
> Sorry for replying on the wrong mailing thread.
>
> Thanks,
> Serguei
>
>
> On 12/18/19 08:42, coleen.phillimore at oracle.com wrote:
>>
>>
>> On 12/18/19 8:45 AM, David Holmes wrote:
>>> Thanks for the additional info Coleen!
>>>
>>> Just to add a bit more to the initialization history. The 
>>> ServiceThread is a generalization of the LowMemoryDetectorThread 
>>> that was part of the management API, and so it was initialized in 
>>> Management::initialize. When it turned into the ServiceThread - to 
>>> process JVMTI deferred events in addition to the low-memory-detector 
>>> events - the initialization placement remained the same. Then later 
>>> the INCLUDE_MANAGEMENT guards were added (JDK-7189254, October 
>>> 2012). Later still we started adding other items of work for the 
>>> ServiceThread. The earliest was the AllocationContextService 
>>> notification in September 2014 but as that no longer exists I can't 
>>> tell if that was the first non-management related use. Then the 
>>> StringTable use was added 18 months ago - which definitely was 
>>> outside the realm of the management API. So that is when the 
>>> MinimalVM was first "broken". So it is good that is fixed.
>>>
>>> With regard to the placement in the initialization order, my 
>>> remaining concern was with JVMTI event processing that might happen 
>>> via events generated very early in the init sequence. But you have 
>>> now modified things so that we will only process events in the LIVE 
>>> phase, which only activates after all the class library 
>>> initialization is complete.
>>>
>>> So overall I'm no longer significantly concerned about the change to 
>>> the initialization order as I think you have it all covered. Thanks 
>>> for bearing with me and all the off-list discussion.
>>
>> Thank you for having this discussion with me and provoking me to 
>> recheck the ServiceThread.   I think we can do further work to 
>> future-proof initialization order but the design needs to be improved.
>>
>> Thanks for reviewing,
>> Coleen
>>>
>>> Cheers,
>>> David
>>> -----
>>>
>>> On 18/12/2019 1:27 am, coleen.phillimore at oracle.com wrote:
>>>>
>>>>
>>>> On 12/16/19 11:04 PM, David Holmes wrote:
>>>>> Clarification ...
>>>>>
>>>>> On 17/12/2019 12:40 pm, coleen.phillimore at oracle.com wrote:
>>>>>>
>>>>>> Short answer below.
>>>>>>
>>>>>> On 12/16/19 5:51 PM, David Holmes wrote:
>>>>>>> Hi Coleen,
>>>>>>>
>>>>>>> A quick initial response ...
>>>>>>>
>>>>>>> On 16/12/2019 11:26 pm, coleen.phillimore at oracle.com wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/16/19 8:04 AM, David Holmes wrote:
>>>>>>>>> Hi Coleen,
>>>>>>>>>
>>>>>>>>> On 16/12/2019 9:41 pm, coleen.phillimore at oracle.com wrote:
>>>>>>>>>> Summary: Start ServiceThread before compiler threads, and run 
>>>>>>>>>> nmethod barriers for zgc before adding to the service thread 
>>>>>>>>>> queue, or posting the events on the java thread queue.
>>>>>>>>>
>>>>>>>>> I can't comment on most of this but the earlier starting of 
>>>>>>>>> the service thread has some concerns:
>>>>>>>>>
>>>>>>>>> - there is a lot of JDK level initialization which now will 
>>>>>>>>> not have happened before the service thread is started and it 
>>>>>>>>> is far from obvious that all possible initialization 
>>>>>>>>> dependencies will be satisfied
>>>>>>>>
>>>>>>>> I agree that the order of initialization is very sensitive. 
>>>>>>>> From the actions that the service thread does, the one that I 
>>>>>>>> found was a problem was that events were posted before the LIVE 
>>>>>>>> phase (see comment in has_events()), which could have happened 
>>>>>>>> with the existing code, but the window for the race is a lot 
>>>>>>>> smaller.   The other actions can be run if there's a GC before 
>>>>>>>> initialization but would be a bug in the initialization code, 
>>>>>>>> and I didn't find these bugs in all my testing. There are some 
>>>>>>>> ordering dependencies that do have odd side effects (between 
>>>>>>>> the compiler thread startup and initialization jsr292 classes) 
>>>>>>>> which have comments. This patch doesn't touch those.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> - current starting of the service thread in 
>>>>>>>>> Management::initialize is guarded by "#if INCLUDE_MANAGEMENT", 
>>>>>>>>> but now you are starting the service thread unconditionally 
>>>>>>>>> for all builds. Hmm just saw your latest comment to the bug 
>>>>>>>>> report - so the service thread is now (for quite some time?) 
>>>>>>>>> being used for other than management tasks and so should 
>>>>>>>>> always be present even if INCLUDE_MANAGEMENT is not enabled. 
>>>>>>>>> Is that sufficient or are there likely to be other changes 
>>>>>>>>> needed to actually ensure that all works correctly  e.g. any 
>>>>>>>>> code the service thread executes that is only defined for 
>>>>>>>>> INCLUDE_MANAGEMENT will need to be compiled out explicitly.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I asked Jie offline to check the minimal build.  I don't think 
>>>>>>>> there are other INCLUDE_MANAGEMENT actions in the service 
>>>>>>>> thread and I'm not sure why it was initialized there in the 
>>>>>>>> first place. The minimal vm would have been broken ie. 
>>>>>>>> hashtables would not have been cleaned up, etc, but I'm not 
>>>>>>>> sure how well that is tested or if one would notice.
>>>>>>>>> - the service thread and the notification thread are (were?) 
>>>>>>>>> closely related but now started at completely different times
>>>>>>>>
>>>>>>>> The notification thread is limited to "services" so it makes 
>>>>>>>> sense where it is.  The ServiceThread does lots of other 
>>>>>>>> things.  Maybe it needs renaming in 15.
>>>>>>>>>
>>>>>>>>> The bug report states the problem as:
>>>>>>>>>
>>>>>>>>> "The graal crash is because compiled_method_load events are 
>>>>>>>>> added to the ServiceThread's deferred event queue before the 
>>>>>>>>> ServiceThread is created so are not walked to keep them from 
>>>>>>>>> being zombied."
>>>>>>>>>
>>>>>>>>> so why isn't the solution to ensure the deferred event queue 
>>>>>>>>> is walked? I'm not clear how starting the service thread 
>>>>>>>>> relates to walking the queue.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The service thread is responsible for walking the deferred 
>>>>>>>> event queue.   See ServiceThread::oops_do/nmethods_do.   The 
>>>>>>>> design could be changed to have some global walk somewhere of 
>>>>>>>> this queue, but essentially this queue is processed by the 
>>>>>>>> service thread.
>>>>>>>
>>>>>>> Sorry I don't follow. I thought "oops_do" and friends are for 
>>>>>>> the GC threads and/or VMThread to call to process oops when GC 
>>>>>>> updates them.
>>>>>>
>>>>>> The oops_do and nmethods_do() can be called by a thread walk in 
>>>>>> handshakes (by the sweeper thread) and by parallel GC thread 
>>>>>> walks. There isn't a single entry to do the thread-specific 
>>>>>> closures that we need to do for these deferred event queues.   I 
>>>>>> tried a version that walked the queues with a static call but 
>>>>>> missed some places where it would be needed to make this call 
>>>>>> (didn't work).  Keeping this associated with the ServiceThread 
>>>>>> simplifies a lot.
>>>>>
>>>>> Just to clarify that further, the thread walk requires the thread 
>>>>> appears in ALL_JAVA_THREADS but that only happens after the 
>>>>> ServiceThread has been started. So in essence we don't really need 
>>>>> the ServiceThread to have commenced execution earlier, but we need 
>>>>> it to have been created. Those two steps are combined in practice.
>>>>
>>>> Yes.  Then the ServiceThread waits on the Service_lock until 
>>>> notified by these events:
>>>>
>>>>        while (((sensors_changed = (!UseNotificationThread && 
>>>> LowMemoryDetector::has_pending_requests())) |
>>>>                (has_jvmti_events = 
>>>> _jvmti_service_queue.has_events()) |
>>>>                (has_gc_notification_event = (!UseNotificationThread 
>>>> && GCNotifier::has_event())) |
>>>>                (has_dcmd_notification_event = 
>>>> (!UseNotificationThread && 
>>>> DCmdFactory::has_pending_jmx_notification())) |
>>>>                (stringtable_work = StringTable::has_work()) |
>>>>                (symboltable_work = SymbolTable::has_work()) |
>>>>                (resolved_method_table_work = 
>>>> ResolvedMethodTable::has_work()) |
>>>>                (thread_id_table_work = ThreadIdTable::has_work()) |
>>>>                (protection_domain_table_work = 
>>>> SystemDictionary::pd_cache_table()->has_work()) |
>>>>                (oopstorage_work = 
>>>> OopStorage::has_cleanup_work_and_reset())
>>>>               ) == 0) {
>>>>
>>>> The first, third and fourth events are from management.cpp events 
>>>> that were initialized after the ServiceThread was started.
>>>> The second event I have changed, to wait until LIVE phase to return 
>>>> true.
>>>> The stringtable, symboltable, resolved_method_table, thread_id and 
>>>> pd table have static _has_work variables initialized to false.
>>>> The oopstorage_work has similar, but may update a time-based 
>>>> counter a bit earlier with the service thread starting earlier. I 
>>>> think this is harmless.
>>>>
>>>> It is possible that after the service thread starts and before the 
>>>> compiler thread starts, there could be a GC that notifies the 
>>>> stringtable to clean up.  This seems like a good thing that the GC 
>>>> would clean up these tables with this order.  I ran the tier4 graal 
>>>> tests and there were no failures.
>>>>
>>>> Thanks,
>>>> Coleen
>>>>>
>>>>> Cheers,
>>>>> David
>>>>>
>>>>>> thanks,
>>>>>> Coleen
>>>>>>
>>>>>>>
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>> I had an additional change to make the queue non-static but 
>>>>>>>> want to limit the change at this point.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Coleen
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>> See bug for description of the problems found with the new 
>>>>>>>>>> Zombie.java test.
>>>>>>>>>>
>>>>>>>>>> open webrev at 
>>>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8235829.01/webrev
>>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8235829
>>>>>>>>>>
>>>>>>>>>> Ran tier1 all platforms, and tier2-8 testing, as well as 
>>>>>>>>>> rerunning original test failure from bug 
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Coleen
>>>>>>>>
>>>>>>
>>>>
>>
>