RFR: JDK-8148992: VM can hang on exit if root region scanning is initiated but not executed

Wed Feb 10 09:02:58 UTC 2016

Hi Jesper (and everyone),

On 2016-02-08 15:50, Jesper Wilhelmsson wrote:
> Looks good!

Thanks for looking at this!

I found a bug in the last webrev. The 
G1ConcurrentMark::scanRootRegions()  has the side effect that it calls 
ClassLoaderDataGraph::clear_claimed_marks(). It is not ok to skip this 
call just because there are no root regions to scan. So, I can't guard 
the call to scanRootRegions() with "if 
(_cm->root_regions()->scan_in_progress())" as I did in webrev.02.

I find this side effect a bit odd. So, I moved 
ClassLoaderDataGraph::clear_claimed_marks() out to its own phase. This 
will also make it show up clearer in our logs if it starts taking a long 
time.

Instead of checking scan_in_progress() both before the call to 
scanRootRegions() and inside of it I moved the assert(!has_aborted()) in 
to scanRootRegions().

Here's an updated webrev:
http://cr.openjdk.java.net/~brutisso/8148992/webrev.03/

And here's a diff compared to the last version:
http://cr.openjdk.java.net/~brutisso/8148992/webrev.02-03.diff/

Thanks,
Bengt

>
> /Jesper
>
> Den 8/2/16 kl. 15:14, skrev Bengt Rutisson:
>>
>> Hi again,
>>
>> Based on some internal feedback from Thomas I have an updated webrev:
>> http://cr.openjdk.java.net/~brutisso/8148992/webrev.02/
>>
>> Changes are:
>>
>> - Changed the assert I added to only apply if we would actually do 
>> any root
>> region scanning
>> - Move the notification on the root region lock into a private helper 
>> method
>> that could be used in both places where this code was duplicated.
>> - Removed an obsolete comment in g1CollectedHeap.cpp
>>
>> Here's a diff compared to the first version:
>> http://cr.openjdk.java.net/~brutisso/8148992/webrev.00-02.diff/
>>
>> (The comment change is only in the full webrev I didn't get it in to 
>> the diff.)
>>
>> Thanks,
>> Bengt
>>
>>
>> On 2016-02-08 11:15, Bengt Rutisson wrote:
>>>
>>> Hi all,
>>>
>>> Could I have a couple of reviews for this change?
>>>
>>> http://cr.openjdk.java.net/~brutisso/8148992/webrev.00
>>> https://bugs.openjdk.java.net/browse/JDK-8148992
>>>
>>> There are some more details in the bug report, but here's the most 
>>> relevant text:
>>>
>>> The reason for the hang is that during shutdown we don't check the 
>>> root region
>>> scanning.
>>>
>>> The ConcurrentMark loop starts like this:
>>>
>>>   while (!_should_terminate) {
>>>     // wait until started is set.
>>>     sleepBeforeNextCycle();
>>>     if (_should_terminate) {
>>>       break;
>>>     }
>>>
>>> If _should_terminate is true we just exit without notifying any 
>>> waiters on the
>>> root region lock. If a GC happens during shutdown the GC will hang 
>>> waiting for
>>> the root region scanning to finish but the ConcurrentMark thread has 
>>> just
>>> exited and will not do any root region scanning.
>>>
>>> I can trigger this behavior by adding a sleep in the above code:
>>>
>>>   while (!_should_terminate) {
>>>     // wait until started is set.
>>>     sleepBeforeNextCycle();
>>>     if (_should_terminate) {
>>>       for (int i = 0; i < 10; i++) {
>>>         os::naked_short_sleep(999);
>>>       }
>>>       break;
>>>     }
>>>
>>> and running this small java program:
>>>
>>> import java.util.LinkedList;
>>>
>>> public class Repro2 {
>>>
>>>     public static LinkedList<byte[]> dummyStore = new LinkedList<>();
>>>
>>>     public static void main(String[] args) throws Exception {
>>>         System.out.println("Started");
>>>         for (int i = 0; i < 1024*16; i++) {
>>>             dummyStore.add(new byte[1024]);
>>>         }
>>>         System.out.println("Triggered one YC");
>>>
>>>         Thread thread = new Thread(()->System.exit(0));
>>>         thread.start();
>>>         Thread.sleep(100);
>>>
>>>         for (int i = 0; i < 1024*16; i++) {
>>>             dummyStore.add(new byte[1024]);
>>>         }
>>>         System.out.println("Triggered Initial mark");
>>>
>>>         System.gc(); // Full GG
>>>
>>>         System.out.println("Done.");
>>>     }
>>> }
>>>
>>>
>>> Running with the sleep added and the following command line:
>>>
>>> java -Xmx16m -Xmx64m -XX:InitiatingHeapOccupancyPercent=0 Repro2
>>>
>>> makes the VM hang every time on my workstation.
>>>
>>> If I add a "cancel_scan()" method and call it before the 
>>> ConcurrentMark thread
>>> is giving up, the VM does not hang anymore. That is, running with 
>>> this code
>>> makes the VM sleep a while during shutdown but it does not hang:
>>>
>>>
>>>   while (!_should_terminate) {
>>>     // wait until started is set.
>>>     sleepBeforeNextCycle();
>>>     if (_should_terminate) {
>>>       for (int i = 0; i < 10; i++) {
>>>         os::naked_short_sleep(999);
>>>       }
>>>       _cm->root_regions()->cancel_scan();
>>>       break;
>>>     }
>>