[14] RFR 8235829: graal crashes with Zombie.java test

Tue Dec 17 15:27:08 UTC 2019

On 12/16/19 11:04 PM, David Holmes wrote:
> Clarification ...
>
> On 17/12/2019 12:40 pm, coleen.phillimore at oracle.com wrote:
>>
>> Short answer below.
>>
>> On 12/16/19 5:51 PM, David Holmes wrote:
>>> Hi Coleen,
>>>
>>> A quick initial response ...
>>>
>>> On 16/12/2019 11:26 pm, coleen.phillimore at oracle.com wrote:
>>>>
>>>>
>>>> On 12/16/19 8:04 AM, David Holmes wrote:
>>>>> Hi Coleen,
>>>>>
>>>>> On 16/12/2019 9:41 pm, coleen.phillimore at oracle.com wrote:
>>>>>> Summary: Start ServiceThread before compiler threads, and run 
>>>>>> nmethod barriers for zgc before adding to the service thread 
>>>>>> queue, or posting the events on the java thread queue.
>>>>>
>>>>> I can't comment on most of this but the earlier starting of the 
>>>>> service thread has some concerns:
>>>>>
>>>>> - there is a lot of JDK level initialization which now will not 
>>>>> have happened before the service thread is started and it is far 
>>>>> from obvious that all possible initialization dependencies will be 
>>>>> satisfied
>>>>
>>>> I agree that the order of initialization is very sensitive. From 
>>>> the actions that the service thread does, the one that I found was 
>>>> a problem was that events were posted before the LIVE phase (see 
>>>> comment in has_events()), which could have happened with the 
>>>> existing code, but the window for the race is a lot smaller.   The 
>>>> other actions can be run if there's a GC before initialization but 
>>>> would be a bug in the initialization code, and I didn't find these 
>>>> bugs in all my testing. There are some ordering dependencies that 
>>>> do have odd side effects (between the compiler thread startup and 
>>>> initialization jsr292 classes) which have comments.  This patch 
>>>> doesn't touch those.
>>>>
>>>>>
>>>>> - current starting of the service thread in Management::initialize 
>>>>> is guarded by "#if INCLUDE_MANAGEMENT", but now you are starting 
>>>>> the service thread unconditionally for all builds. Hmm just saw 
>>>>> your latest comment to the bug report - so the service thread is 
>>>>> now (for quite some time?) being used for other than management 
>>>>> tasks and so should always be present even if INCLUDE_MANAGEMENT 
>>>>> is not enabled. Is that sufficient or are there likely to be other 
>>>>> changes needed to actually ensure that all works correctly  e.g. 
>>>>> any code the service thread executes that is only defined for 
>>>>> INCLUDE_MANAGEMENT will need to be compiled out explicitly.
>>>>>
>>>>
>>>> I asked Jie offline to check the minimal build.  I don't think 
>>>> there are other INCLUDE_MANAGEMENT actions in the service thread 
>>>> and I'm not sure why it was initialized there in the first place.  
>>>> The minimal vm would have been broken ie. hashtables would not have 
>>>> been cleaned up, etc, but I'm not sure how well that is tested or 
>>>> if one would notice.
>>>>> - the service thread and the notification thread are (were?) 
>>>>> closely related but now started at completely different times
>>>>
>>>> The notification thread is limited to "services" so it makes sense 
>>>> where it is.  The ServiceThread does lots of other things.  Maybe 
>>>> it needs renaming in 15.
>>>>>
>>>>> The bug report states the problem as:
>>>>>
>>>>> "The graal crash is because compiled_method_load events are added 
>>>>> to the ServiceThread's deferred event queue before the 
>>>>> ServiceThread is created so are not walked to keep them from being 
>>>>> zombied."
>>>>>
>>>>> so why isn't the solution to ensure the deferred event queue is 
>>>>> walked? I'm not clear how starting the service thread relates to 
>>>>> walking the queue.
>>>>>
>>>>
>>>> The service thread is responsible for walking the deferred event 
>>>> queue.   See ServiceThread::oops_do/nmethods_do.   The design could 
>>>> be changed to have some global walk somewhere of this queue, but 
>>>> essentially this queue is processed by the service thread.
>>>
>>> Sorry I don't follow. I thought "oops_do" and friends are for the GC 
>>> threads and/or VMThread to call to process oops when GC updates them.
>>
>> The oops_do and nmethods_do() can be called by a thread walk in 
>> handshakes (by the sweeper thread) and by parallel GC thread walks. 
>> There isn't a single entry to do the thread-specific closures that we 
>> need to do for these deferred event queues.   I tried a version that 
>> walked the queues with a static call but missed some places where it 
>> would be needed to make this call (didn't work).  Keeping this 
>> associated with the ServiceThread simplifies a lot.
>
> Just to clarify that further, the thread walk requires the thread 
> appears in ALL_JAVA_THREADS but that only happens after the 
> ServiceThread has been started. So in essence we don't really need the 
> ServiceThread to have commenced execution earlier, but we need it to 
> have been created. Those two steps are combined in practice.

Yes.  Then the ServiceThread waits on the Service_lock until notified by 
these events:

       while (((sensors_changed = (!UseNotificationThread && 
LowMemoryDetector::has_pending_requests())) |
               (has_jvmti_events = _jvmti_service_queue.has_events()) |
               (has_gc_notification_event = (!UseNotificationThread && 
GCNotifier::has_event())) |
               (has_dcmd_notification_event = (!UseNotificationThread && 
DCmdFactory::has_pending_jmx_notification())) |
               (stringtable_work = StringTable::has_work()) |
               (symboltable_work = SymbolTable::has_work()) |
               (resolved_method_table_work = 
ResolvedMethodTable::has_work()) |
               (thread_id_table_work = ThreadIdTable::has_work()) |
               (protection_domain_table_work = 
SystemDictionary::pd_cache_table()->has_work()) |
               (oopstorage_work = OopStorage::has_cleanup_work_and_reset())
              ) == 0) {

The first, third and fourth events are from management.cpp events that 
were initialized after the ServiceThread was started.
The second event I have changed, to wait until LIVE phase to return true.
The stringtable, symboltable, resolved_method_table, thread_id and pd 
table have static _has_work variables initialized to false.
The oopstorage_work has similar, but may update a time-based counter a 
bit earlier with the service thread starting earlier.  I think this is 
harmless.

It is possible that after the service thread starts and before the 
compiler thread starts, there could be a GC that notifies the 
stringtable to clean up.  This seems like a good thing that the GC would 
clean up these tables with this order.  I ran the tier4 graal tests and 
there were no failures.

Thanks,
Coleen
>
> Cheers,
> David
>
>> thanks,
>> Coleen
>>
>>>
>>> David
>>> -----
>>>
>>>> I had an additional change to make the queue non-static but want to 
>>>> limit the change at this point.
>>>>
>>>> Thanks,
>>>> Coleen
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>> See bug for description of the problems found with the new 
>>>>>> Zombie.java test.
>>>>>>
>>>>>> open webrev at 
>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8235829.01/webrev
>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8235829
>>>>>>
>>>>>> Ran tier1 all platforms, and tier2-8 testing, as well as 
>>>>>> rerunning original test failure from bug 
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361.
>>>>>>
>>>>>> Thanks,
>>>>>> Coleen
>>>>
>>