RFD: C2 bug: 8157306 random infrequent null pointer exceptions in javac
Andrew Haley
aph at redhat.com
Wed Jun 22 18:22:32 UTC 2016
This bogus NullPointerException happens because nodes are cloned from
a block into its successors by PhaseCFG::call_catch_cleanup. When the
clone is performed we don't do an anti-dependence check, so scheduling
is free to move a store even above a load which has an anti-dependence.
The questionable code in call_catch_cleanup is here:
// Clone along all Catch output paths. Clone area between the 'beg' and
// 'end' indices.
for( uint i = 0; i < block->_num_succs; i++ ) {
Block *sb = block->_succs[i];
// Clone the entire area; ignoring the edge fixup for now.
for( uint j = end; j > beg; j-- ) {
// It is safe here to clone a node with anti_dependence
// since clones dominate on each path.
Node *clone = block->get_node(j-1)->clone();
sb->insert_node(clone, 1);
map_node_to_block(clone, sb);
}
}
There is a fairly simple fix which passes smoke tests:
aph at arm64:~/jdk8u/hotspot$ hg revert src/share/vm/opto/lcm.cpp
aph at arm64:~/jdk8u/hotspot$ hg diff src/share/vm/opto/lcm.cpp
diff --git a/src/share/vm/opto/lcm.cpp b/src/share/vm/opto/lcm.cpp
--- a/src/share/vm/opto/lcm.cpp
+++ b/src/share/vm/opto/lcm.cpp
@@ -1090,11 +1090,12 @@
Block *sb = block->_succs[i];
// Clone the entire area; ignoring the edge fixup for now.
for( uint j = end; j > beg; j-- ) {
- // It is safe here to clone a node with anti_dependence
- // since clones dominate on each path.
Node *clone = block->get_node(j-1)->clone();
sb->insert_node(clone, 1);
map_node_to_block(clone, sb);
+ if (clone->needs_anti_dependence_check()) {
+ insert_anti_dependences(sb, clone);
+ }
}
}
And this changes the code we generate accordingly, Bad on the left,
Good on the right:
mov x11, xzr ldr x13, [sp,#32]
ldr x13, [sp,#32] ldr w10, [x13,#12]
lsr x12, x13, #9 mov x11, xzr
str w11, [x13,#12] lsr x12, x13, #9
adrp x14, word_map_base str w11, [x13,#12]
ldr w10, [x13,#12] adrp x14, word_map_base
ldr w11, [sp,#16] ldr w11, [sp,#16]
strb wzr, [x14,x12,lsl #0] strb wzr, [x14,x12,lsl #0]
The critical loads and stores are from [x13,#12], and the load must
come before the store.
The really baffling thing for me, of course, is why this hasn't been
seen before. It looks obviously wrong but it's very old code in C2.
Could it be that this bug has been causing occasional faults for years
but no-one has tracked it down because it's so hard to reproduce?
Just to confirm this for a non-AArch64 target, I added an assertion in
PhaseCFG::call_catch_cleanup to check that precedence edges were
correct, and on x86-64 I got:
# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc: SuppressErrorAt=/gcm.cpp:748
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/home/aph/hs-comp/hotspot/src/share/vm/opto/gcm.cpp:748), pid=14847, tid=14854
# assert(store->find_edge(load) != -1) failed: missing precedence edge
#
# JRE version: OpenJDK Runtime Environment (9.0) (slowdebug build 9-internal+0-2016-06-22-183041.aph.hs-comp)
# Java VM: OpenJDK 64-Bit Server VM (slowdebug 9-internal+0-2016-06-22-183041.aph.hs-comp, mixed mode, tiered, compressed oops, serial gc, linux-amd64)
# Core dump will be written. Default location: /home/aph/hs-comp/make/core.14847
#
# An error report file with more information is saved as:
# /home/aph/hs-comp/make/hs_err_pid14847.log
So: IMO this is a real bug, not just on AArch64, but on x86 as well.
It may or may not actually result in bad code being generated; that
depends on the scheduling.
Andrew.
More information about the hotspot-dev
mailing list