RFD: C2 bug: 8157306 random infrequent null pointer exceptions in javac

Wed Jun 22 18:22:32 UTC 2016

This bogus NullPointerException happens because nodes are cloned from
a block into its successors by PhaseCFG::call_catch_cleanup.  When the
clone is performed we don't do an anti-dependence check, so scheduling
is free to move a store even above a load which has an anti-dependence.

The questionable code in call_catch_cleanup is here:

  // Clone along all Catch output paths.  Clone area between the 'beg' and
  // 'end' indices.
  for( uint i = 0; i < block->_num_succs; i++ ) {
    Block *sb = block->_succs[i];
    // Clone the entire area; ignoring the edge fixup for now.
    for( uint j = end; j > beg; j-- ) {
      // It is safe here to clone a node with anti_dependence
      // since clones dominate on each path.
      Node *clone = block->get_node(j-1)->clone();
      sb->insert_node(clone, 1);
      map_node_to_block(clone, sb);
    }
  }

There is a fairly simple fix which passes smoke tests:

aph at arm64:~/jdk8u/hotspot$ hg revert src/share/vm/opto/lcm.cpp
aph at arm64:~/jdk8u/hotspot$ hg diff src/share/vm/opto/lcm.cpp

diff --git a/src/share/vm/opto/lcm.cpp b/src/share/vm/opto/lcm.cpp
--- a/src/share/vm/opto/lcm.cpp
+++ b/src/share/vm/opto/lcm.cpp
@@ -1090,11 +1090,12 @@
     Block *sb = block->_succs[i];
     // Clone the entire area; ignoring the edge fixup for now.
     for( uint j = end; j > beg; j-- ) {
-      // It is safe here to clone a node with anti_dependence
-      // since clones dominate on each path.
       Node *clone = block->get_node(j-1)->clone();
       sb->insert_node(clone, 1);
       map_node_to_block(clone, sb);
+      if (clone->needs_anti_dependence_check()) {
+        insert_anti_dependences(sb, clone);
+      }
     }
   }

And this changes the code we generate accordingly, Bad on the left,
Good on the right:

  mov   x11, xzr                    ldr   x13, [sp,#32]
  ldr   x13, [sp,#32]               ldr   w10, [x13,#12]
  lsr   x12, x13, #9                mov   x11, xzr
  str   w11, [x13,#12]              lsr   x12, x13, #9
  adrp  x14, word_map_base          str   w11, [x13,#12]
  ldr   w10, [x13,#12]              adrp  x14, word_map_base
  ldr   w11, [sp,#16]               ldr   w11, [sp,#16]
  strb  wzr, [x14,x12,lsl #0]       strb  wzr, [x14,x12,lsl #0]

The critical loads and stores are from [x13,#12], and the load must
come before the store.

The really baffling thing for me, of course, is why this hasn't been
seen before.  It looks obviously wrong but it's very old code in C2.
Could it be that this bug has been causing occasional faults for years
but no-one has tracked it down because it's so hard to reproduce?

Just to confirm this for a non-AArch64 target, I added an assertion in
PhaseCFG::call_catch_cleanup to check that precedence edges were
correct, and on x86-64 I got:

# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/gcm.cpp:748
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/aph/hs-comp/hotspot/src/share/vm/opto/gcm.cpp:748), pid=14847, tid=14854
#  assert(store->find_edge(load) != -1) failed: missing precedence edge
#
# JRE version: OpenJDK Runtime Environment (9.0) (slowdebug build 9-internal+0-2016-06-22-183041.aph.hs-comp)
# Java VM: OpenJDK 64-Bit Server VM (slowdebug 9-internal+0-2016-06-22-183041.aph.hs-comp, mixed mode, tiered, compressed oops, serial gc, linux-amd64)
# Core dump will be written. Default location: /home/aph/hs-comp/make/core.14847
#
# An error report file with more information is saved as:
# /home/aph/hs-comp/make/hs_err_pid14847.log

So: IMO this is a real bug, not just on AArch64, but on x86 as well.
It may or may not actually result in bad code being generated; that
depends on the scheduling.

Andrew.