RFD: C2 bug: 8157306 random infrequent null pointer exceptions in javac

Wed Jun 22 21:13:40 UTC 2016

Thank you, Andrew, for finding the cause of this problem.
Your fix looks good to me. Can you also add changes for the assert you mentioned - I think it is useful.
And, please, prepare webrev.

May be scheduling changes in LCM (JDK-8134802) cause this happen more frequently.
Before the cloning was done at the beginning of block dominating stores in the block and original scheduling did not 
move it.
Also rarity could be because splitting through call projections are rare and you also need to have a store in the cloned 
block.

Thanks,
Vladimir

On 6/22/16 11:22 AM, Andrew Haley wrote:
> This bogus NullPointerException happens because nodes are cloned from
> a block into its successors by PhaseCFG::call_catch_cleanup.  When the
> clone is performed we don't do an anti-dependence check, so scheduling
> is free to move a store even above a load which has an anti-dependence.
>
> The questionable code in call_catch_cleanup is here:
>
>   // Clone along all Catch output paths.  Clone area between the 'beg' and
>   // 'end' indices.
>   for( uint i = 0; i < block->_num_succs; i++ ) {
>     Block *sb = block->_succs[i];
>     // Clone the entire area; ignoring the edge fixup for now.
>     for( uint j = end; j > beg; j-- ) {
>       // It is safe here to clone a node with anti_dependence
>       // since clones dominate on each path.
>       Node *clone = block->get_node(j-1)->clone();
>       sb->insert_node(clone, 1);
>       map_node_to_block(clone, sb);
>     }
>   }
>
> There is a fairly simple fix which passes smoke tests:
>
> aph at arm64:~/jdk8u/hotspot$ hg revert src/share/vm/opto/lcm.cpp
> aph at arm64:~/jdk8u/hotspot$ hg diff src/share/vm/opto/lcm.cpp
> diff --git a/src/share/vm/opto/lcm.cpp b/src/share/vm/opto/lcm.cpp
> --- a/src/share/vm/opto/lcm.cpp
> +++ b/src/share/vm/opto/lcm.cpp
> @@ -1090,11 +1090,12 @@
>      Block *sb = block->_succs[i];
>      // Clone the entire area; ignoring the edge fixup for now.
>      for( uint j = end; j > beg; j-- ) {
> -      // It is safe here to clone a node with anti_dependence
> -      // since clones dominate on each path.
>        Node *clone = block->get_node(j-1)->clone();
>        sb->insert_node(clone, 1);
>        map_node_to_block(clone, sb);
> +      if (clone->needs_anti_dependence_check()) {
> +        insert_anti_dependences(sb, clone);
> +      }
>      }
>    }
>
> And this changes the code we generate accordingly, Bad on the left,
> Good on the right:
>
>   mov   x11, xzr                    ldr   x13, [sp,#32]
>   ldr   x13, [sp,#32]               ldr   w10, [x13,#12]
>   lsr   x12, x13, #9                mov   x11, xzr
>   str   w11, [x13,#12]              lsr   x12, x13, #9
>   adrp  x14, word_map_base          str   w11, [x13,#12]
>   ldr   w10, [x13,#12]              adrp  x14, word_map_base
>   ldr   w11, [sp,#16]               ldr   w11, [sp,#16]
>   strb  wzr, [x14,x12,lsl #0]       strb  wzr, [x14,x12,lsl #0]
>
> The critical loads and stores are from [x13,#12], and the load must
> come before the store.
>
> The really baffling thing for me, of course, is why this hasn't been
> seen before.  It looks obviously wrong but it's very old code in C2.
> Could it be that this bug has been causing occasional faults for years
> but no-one has tracked it down because it's so hard to reproduce?
>
> Just to confirm this for a non-AArch64 target, I added an assertion in
> PhaseCFG::call_catch_cleanup to check that precedence edges were
> correct, and on x86-64 I got:
>
> # To suppress the following error report, specify this argument
> # after -XX: or in .hotspotrc:  SuppressErrorAt=/gcm.cpp:748
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error (/home/aph/hs-comp/hotspot/src/share/vm/opto/gcm.cpp:748), pid=14847, tid=14854
> #  assert(store->find_edge(load) != -1) failed: missing precedence edge
> #
> # JRE version: OpenJDK Runtime Environment (9.0) (slowdebug build 9-internal+0-2016-06-22-183041.aph.hs-comp)
> # Java VM: OpenJDK 64-Bit Server VM (slowdebug 9-internal+0-2016-06-22-183041.aph.hs-comp, mixed mode, tiered, compressed oops, serial gc, linux-amd64)
> # Core dump will be written. Default location: /home/aph/hs-comp/make/core.14847
> #
> # An error report file with more information is saved as:
> # /home/aph/hs-comp/make/hs_err_pid14847.log
>
> So: IMO this is a real bug, not just on AArch64, but on x86 as well.
> It may or may not actually result in bad code being generated; that
> depends on the scheduling.
>
> Andrew.
>