[8u] JDK-8191393: Random crashes during cfree+0x1c

Wed Apr 8 15:50:53 UTC 2020

Looks good now, except one more small issue.

I forgot to say before that, given that rotate_log and write are now synchronized, the rotate_log method comment is inaccurate. It could be changed to eliminate the reference to 8191393 and read something like

    // rotate_log must be called from the VMThread at a safepoint. In case of need to do
    // gc log rotation from a thread other than the VMThread, a subtype of VM_Operation
    // should be created and be submitted to the VMThread's operation queue. DO NOT call this
    // function directly. It is safe to rotate the log at a safepoint via the VMThread because
    // no mutator threads run concurrently with the VMThread, and GC threads that run
    // concurrently with the VMThread are synchronized in write and rotate_log via _file_lock.
    // rotate_log can write log entries, so write does a recursive lock check on _file_lock.

Thanks,
Paul

On 4/7/20, 9:25 PM, "Liu, Xin" <xxinliu at amazon.com> wrote:

    Hi, Paul, Andrew and Sergey,  

    I found there're a multiple bug reports, starting from 8u65.
    https://bugs.openjdk.java.net/browse/JDK-8223603?jql=component%20%3D%20hotspot%20AND%20labels%20%3D%20webbug%20AND%20text%20~%20%22cfree%2B0x1c%22
    Except one ParallelGC,  I found a common pattern of them is +UseG1GC and -XX:+UseGCLogFileRotation

    Here is my analysis of the race condition between ConcurrentMarkThread and VMThread.  Please correct me if I am wrong. 

    In SafepointSynchronize::begin, VMThread does synchronize with SuspendibleThreadSet.
    https://hg.openjdk.java.net/jdk8u/jdk8u-dev/hotspot/file/f37c2dd33031/src/share/vm/runtime/safepoint.cpp#l183

    Here's an abstract concept called SuspendibleThreadSet(STS). Only threads who voluntarily join the set honor a safepoint synchronization.
    ConcurrentMarkThread::run() in the main body of ConcurrentMarkThread of G1. I found it  only partially takes part in this set. It joins SuspendibleThreadSet time to time.
    You can search a construct called "SuspendibleThreadSetJoiner" as indicator.  Many gclog sites fall outside of that construct. IMHO, It means that it runs out of sync with VMThread.  

    We certainly could put ConcurrentMarkThread into STS completely.  I don't think it's good idea because it will cause Marking phrases stops time to time(bad#1). Each time it yields, it might be scheduled to another processor with cold cache and affinity (bad#2).
    On the other side, I don't think logging is a performance-critical task. Developers who pursue extreme performance can opt out gclog.
    It would be nice to have a mutex to make gclog "more serial". Actually, with it,  we can guarantee there's no interleaved messages in a line. It actually improve readability as a side effect.

    The reason I need to add a flag called "_gclog_reentry" because there's a corner case. When VMThread is doing gclog rotation, it calls "write" from "rotate_log". It writes gclog header to new file. The mutex of hotspot doesn't support reentry.  That flag is for reentry avoidance.
    I took all your suggestions except one.  I found there're multiple exits in gcLogFileStream::rotate_log_impl.   It's easier to manage _gclog_reentry in a parent function. Further, it leaves a function without a mutex. It may be useful if someone can truly stop the world. 
    I also take Andrew Dinn and Sergey's advice. Thank you.  It might be a problem if we run memory leak analysis , so I delete it in the dtor. 

    Here is the new revision of webrev. Could you take a look at it? 
    https://cr.openjdk.java.net/~xliu/8191393/01/webrev/

    Because this problem only happens when gc rotation is enable. I don't use mutex protection if gc rotation is off in the new revision. I.e. It shouldn't have any overhead for normal case.  
    I passed hotspot-tier1 and have run several hours for the correctness check.

    Thanks,
    --lx

    On 4/7/20, 8:20 AM, "Hohensee, Paul" <hohensee at amazon.com> wrote:

        Hi, Xin.

        Could you explain the nature of the race condition between concurrent mark threads and the vm thread? And why _gclog_reentry is needed (i.e., how the vm thread can call write() from rotate_gc_log())?

        The comment on rotate_log() in ostream.cpp (you can see most of it in the webrev) is:

        // rotate_log must be called from VMThread at safepoint. In case need change parameters
        // for gc log rotation from thread other than VMThread, a sub type of VM_Operation
        // should be created and be submitted to VMThread's operation queue. DO NOT call this
        // function directly. Currently, it is safe to rotate log at safepoint through VMThread.
        // That is, no mutator threads and concurrent GC threads run parallel with VMThread to
        // write to gc log file at safepoint. If in future, changes made for mutator threads or
        // concurrent GC threads to run parallel with VMThread at safepoint, write and rotate_log
        // must be synchronized.

        I'd take a look at the CMS (and G1) concurrent mark code to discover why it's running during a safepoint. That might be where the problem should be fixed, rather than in the logging code.

        Assuming the fix should be in the logging code:

        In vmThread.*:

        The names of simple getters are conventionally the same as the field name. In this case, I'd use _reentry_gclog as the field name rather than _reentry_gclog_writing.

        In ostream.*:

        I don't understand the comment in gcLogFileStream::write() that says

        	// can't use Thread::current() because thread is NULL before initialization

        It'd be good for the comment to explain how it can happen that a GC log entry can be written before Thread::current() is valid.

        In rotate_log(), you can hoist the assignment to thread out of the assert and use it instead of vmthread in the locked region.

        In rotate_log(), it looks like you added rotate_gc_log_impl() because part of the race is in should_rotate(). But, you don't need to add rotate_gc_log_impl(), you can just put the new code into the existing rotate_log().

        Thanks,
        Paul

        On 4/6/20, 3:47 AM, "jdk8u-dev on behalf of Liu, Xin" <jdk8u-dev-bounces at openjdk.java.net on behalf of xxinliu at amazon.com> wrote:

            Hi,  Reviewers, 

            May I request review this bugfix? 
            Bug: https://bugs.openjdk.java.net/browse/JDK-8191393
            Webrev: https://cr.openjdk.java.net/~xliu/8191393/webrev/

            I believe that root cause  is the race condition between ConcurrentMarkThread and the VMThread.
            We can’t directly backport from upstream because upstream is based on unified logging framework. further, TIP doesn’t have reentry problem while jdk8u has. 
            Instead, this patch uses a mutex to protect FILE* fileStream::_file and a Boolean flag to avoid reentry.  

            Of course, a mutex  is not free. Therefore, I didn't use it if users don't emit log to file. 

            Test: 
            I ran the stress test over 24 hours and no problem is identified. 
            I passed tier1 test of jtreg. 

            Thanks,
            --lx