[14] RFR 8235829: graal crashes with Zombie.java test
David Holmes
david.holmes at oracle.com
Wed Dec 18 13:45:25 UTC 2019
Thanks for the additional info Coleen!
Just to add a bit more to the initialization history. The ServiceThread
is a generalization of the LowMemoryDetectorThread that was part of the
management API, and so it was initialized in Management::initialize.
When it turned into the ServiceThread - to process JVMTI deferred events
in addition to the low-memory-detector events - the initialization
placement remained the same. Then later the INCLUDE_MANAGEMENT guards
were added (JDK-7189254, October 2012). Later still we started adding
other items of work for the ServiceThread. The earliest was the
AllocationContextService notification in September 2014 but as that no
longer exists I can't tell if that was the first non-management related
use. Then the StringTable use was added 18 months ago - which definitely
was outside the realm of the management API. So that is when the
MinimalVM was first "broken". So it is good that is fixed.
With regard to the placement in the initialization order, my remaining
concern was with JVMTI event processing that might happen via events
generated very early in the init sequence. But you have now modified
things so that we will only process events in the LIVE phase, which only
activates after all the class library initialization is complete.
So overall I'm no longer significantly concerned about the change to the
initialization order as I think you have it all covered. Thanks for
bearing with me and all the off-list discussion.
Cheers,
David
-----
On 18/12/2019 1:27 am, coleen.phillimore at oracle.com wrote:
>
>
> On 12/16/19 11:04 PM, David Holmes wrote:
>> Clarification ...
>>
>> On 17/12/2019 12:40 pm, coleen.phillimore at oracle.com wrote:
>>>
>>> Short answer below.
>>>
>>> On 12/16/19 5:51 PM, David Holmes wrote:
>>>> Hi Coleen,
>>>>
>>>> A quick initial response ...
>>>>
>>>> On 16/12/2019 11:26 pm, coleen.phillimore at oracle.com wrote:
>>>>>
>>>>>
>>>>> On 12/16/19 8:04 AM, David Holmes wrote:
>>>>>> Hi Coleen,
>>>>>>
>>>>>> On 16/12/2019 9:41 pm, coleen.phillimore at oracle.com wrote:
>>>>>>> Summary: Start ServiceThread before compiler threads, and run
>>>>>>> nmethod barriers for zgc before adding to the service thread
>>>>>>> queue, or posting the events on the java thread queue.
>>>>>>
>>>>>> I can't comment on most of this but the earlier starting of the
>>>>>> service thread has some concerns:
>>>>>>
>>>>>> - there is a lot of JDK level initialization which now will not
>>>>>> have happened before the service thread is started and it is far
>>>>>> from obvious that all possible initialization dependencies will be
>>>>>> satisfied
>>>>>
>>>>> I agree that the order of initialization is very sensitive. From
>>>>> the actions that the service thread does, the one that I found was
>>>>> a problem was that events were posted before the LIVE phase (see
>>>>> comment in has_events()), which could have happened with the
>>>>> existing code, but the window for the race is a lot smaller. The
>>>>> other actions can be run if there's a GC before initialization but
>>>>> would be a bug in the initialization code, and I didn't find these
>>>>> bugs in all my testing. There are some ordering dependencies that
>>>>> do have odd side effects (between the compiler thread startup and
>>>>> initialization jsr292 classes) which have comments. This patch
>>>>> doesn't touch those.
>>>>>
>>>>>>
>>>>>> - current starting of the service thread in Management::initialize
>>>>>> is guarded by "#if INCLUDE_MANAGEMENT", but now you are starting
>>>>>> the service thread unconditionally for all builds. Hmm just saw
>>>>>> your latest comment to the bug report - so the service thread is
>>>>>> now (for quite some time?) being used for other than management
>>>>>> tasks and so should always be present even if INCLUDE_MANAGEMENT
>>>>>> is not enabled. Is that sufficient or are there likely to be other
>>>>>> changes needed to actually ensure that all works correctly e.g.
>>>>>> any code the service thread executes that is only defined for
>>>>>> INCLUDE_MANAGEMENT will need to be compiled out explicitly.
>>>>>>
>>>>>
>>>>> I asked Jie offline to check the minimal build. I don't think
>>>>> there are other INCLUDE_MANAGEMENT actions in the service thread
>>>>> and I'm not sure why it was initialized there in the first place.
>>>>> The minimal vm would have been broken ie. hashtables would not have
>>>>> been cleaned up, etc, but I'm not sure how well that is tested or
>>>>> if one would notice.
>>>>>> - the service thread and the notification thread are (were?)
>>>>>> closely related but now started at completely different times
>>>>>
>>>>> The notification thread is limited to "services" so it makes sense
>>>>> where it is. The ServiceThread does lots of other things. Maybe
>>>>> it needs renaming in 15.
>>>>>>
>>>>>> The bug report states the problem as:
>>>>>>
>>>>>> "The graal crash is because compiled_method_load events are added
>>>>>> to the ServiceThread's deferred event queue before the
>>>>>> ServiceThread is created so are not walked to keep them from being
>>>>>> zombied."
>>>>>>
>>>>>> so why isn't the solution to ensure the deferred event queue is
>>>>>> walked? I'm not clear how starting the service thread relates to
>>>>>> walking the queue.
>>>>>>
>>>>>
>>>>> The service thread is responsible for walking the deferred event
>>>>> queue. See ServiceThread::oops_do/nmethods_do. The design could
>>>>> be changed to have some global walk somewhere of this queue, but
>>>>> essentially this queue is processed by the service thread.
>>>>
>>>> Sorry I don't follow. I thought "oops_do" and friends are for the GC
>>>> threads and/or VMThread to call to process oops when GC updates them.
>>>
>>> The oops_do and nmethods_do() can be called by a thread walk in
>>> handshakes (by the sweeper thread) and by parallel GC thread walks.
>>> There isn't a single entry to do the thread-specific closures that we
>>> need to do for these deferred event queues. I tried a version that
>>> walked the queues with a static call but missed some places where it
>>> would be needed to make this call (didn't work). Keeping this
>>> associated with the ServiceThread simplifies a lot.
>>
>> Just to clarify that further, the thread walk requires the thread
>> appears in ALL_JAVA_THREADS but that only happens after the
>> ServiceThread has been started. So in essence we don't really need the
>> ServiceThread to have commenced execution earlier, but we need it to
>> have been created. Those two steps are combined in practice.
>
> Yes. Then the ServiceThread waits on the Service_lock until notified by
> these events:
>
> while (((sensors_changed = (!UseNotificationThread &&
> LowMemoryDetector::has_pending_requests())) |
> (has_jvmti_events = _jvmti_service_queue.has_events()) |
> (has_gc_notification_event = (!UseNotificationThread &&
> GCNotifier::has_event())) |
> (has_dcmd_notification_event = (!UseNotificationThread &&
> DCmdFactory::has_pending_jmx_notification())) |
> (stringtable_work = StringTable::has_work()) |
> (symboltable_work = SymbolTable::has_work()) |
> (resolved_method_table_work =
> ResolvedMethodTable::has_work()) |
> (thread_id_table_work = ThreadIdTable::has_work()) |
> (protection_domain_table_work =
> SystemDictionary::pd_cache_table()->has_work()) |
> (oopstorage_work = OopStorage::has_cleanup_work_and_reset())
> ) == 0) {
>
> The first, third and fourth events are from management.cpp events that
> were initialized after the ServiceThread was started.
> The second event I have changed, to wait until LIVE phase to return true.
> The stringtable, symboltable, resolved_method_table, thread_id and pd
> table have static _has_work variables initialized to false.
> The oopstorage_work has similar, but may update a time-based counter a
> bit earlier with the service thread starting earlier. I think this is
> harmless.
>
> It is possible that after the service thread starts and before the
> compiler thread starts, there could be a GC that notifies the
> stringtable to clean up. This seems like a good thing that the GC would
> clean up these tables with this order. I ran the tier4 graal tests and
> there were no failures.
>
> Thanks,
> Coleen
>>
>> Cheers,
>> David
>>
>>> thanks,
>>> Coleen
>>>
>>>>
>>>> David
>>>> -----
>>>>
>>>>> I had an additional change to make the queue non-static but want to
>>>>> limit the change at this point.
>>>>>
>>>>> Thanks,
>>>>> Coleen
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>> See bug for description of the problems found with the new
>>>>>>> Zombie.java test.
>>>>>>>
>>>>>>> open webrev at
>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8235829.01/webrev
>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8235829
>>>>>>>
>>>>>>> Ran tier1 all platforms, and tier2-8 testing, as well as
>>>>>>> rerunning original test failure from bug
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Coleen
>>>>>
>>>
>
More information about the serviceability-dev
mailing list