[aarch64-port-dev ] Help debugging problem with large code cache

Wed Dec 2 14:24:26 UTC 2015

Hi

I have been trying to debug a problem with large code caches on JDK 9 over the past week and could do with some help/advice on how to proceed.

I have filed a JIRA issue

https://bugs.openjdk.java.net/browse/JDK-8144498

Here is my analysis of the problem so far. Apologies if this is a bit stream of consciousness.

Running jtreg/langtools with -XX:ReservedCodeCacheSize=512m generates a number of failures dues to SEGVs whereas running without this option passes all tests.

The set of tests which fails each time is different. For example on two back to back runs I get

FAILED: tools/javac/classfiles/attributes/annotations/RuntimeAnnotationsForInnerAnnotationTest.java
FAILED: tools/javac/T6410706.java
FAILED: tools/jdeps/DotFileTest.java
ed at arm64:~/jtreg/jtreg$ fgrep FAILED log_512m_2
FAILED: com/sun/javadoc/testSimpleTag/TestSimpleTag.java
FAILED: com/sun/javadoc/testWindowTitle/TestWindowTitle.java
FAILED: jdk/jshell/CompletionSuggestionTest.java

The command used to invoke jtreg was

/home/ed/images/jdk9-orig/bin/java -jar lib/jtreg.jar -vmoption:-XX:ReservedCodeCacheSize=512m -nr -conc:48 -timeout:99 -othervm -jdk:/home/ed/images/jdk9-orig -v1 -a -ignore:quiet /home/ed/new_jdk9/hs-comp/langtools/test

The problem can also be replicated with EEMBC GrinderBench although it may required many 100s of runs to trigger. The command I used to invoke GrinderBench is

/home/ed/images/jdk9-orig/bin/java -XX:ReservedCodeCacheSize=512m -classpath dist/fullset/bench1.jar org.eembc.grinderbench.CmdlineWrapper -r 1 -m 1 -t 4

For the purposes of the following I have chosen to investigate the GrinderBench failure because it is easier to debug than random failures in jtreg/

The SEGV occurs in a method which is called from SharedRuntime::resolve_opt_virtual_call_C. The call backtrace is about 20 frames long. The following are the oldest few frames.

....
#17 0x000003ff99717a44 in SharedRuntime::resolve_helper (thread=thread at entry=0x3ff94010000,
    is_virtual=is_virtual at entry=true, is_optimized=is_optimized at entry=true,
    __the_thread__=__the_thread__ at entry=0x3ff94010000)
    at /home/ed/new_jdk9/hs-comp/hotspot/src/share/vm/runtime/sharedRuntime.cpp:1186
#18 0x000003ff99718988 in SharedRuntime::resolve_opt_virtual_call_C (thread=0x3ff94010000)
    at /home/ed/new_jdk9/hs-comp/hotspot/src/share/vm/runtime/sharedRuntime.cpp:1441
#19 0x000003ff70ab23a8 in ?? ()
#20 0x000003fdd59228f0 in ?? ()

Looking at frame #19

(gdb) x/10i $pc-20
   0x3ff70ab2394:       mov     x0, x28
   0x3ff70ab2398:       mov     x8, #0x8950                     // #35152
   0x3ff70ab239c:       movk    x8, #0x9971, lsl #16
   0x3ff70ab23a0:       movk    x8, #0x3ff, lsl #32
   0x3ff70ab23a4:       blr     x8
=> 0x3ff70ab23a8:       isb
   0x3ff70ab23ac:       str     xzr, [x28,#440]
   0x3ff70ab23b0:       str     xzr, [x28,#448]
   0x3ff70ab23b4:       ldr     x8, [x28,#8]
   0x3ff70ab23b8:       cbnz    x8, 0x3ff70ab2454

This is a stub for resolve_opt_virtual_call. So here it calls 0x3ff99718950 and disassembling that

(gdb) x/i 0x3ff99718950
   0x3ff99718950 <SharedRuntime::resolve_opt_virtual_call_C(JavaThread*)>:
    stp x29, x30, [sp,#-80]!

So it is calling SharedRuntime::resolve_opt_virtual_call_C which is correct according to the above stack trace. However, looking at the previous frame

(gdb) x/2g $fp
0x3ff98dede60:  0x0000000000000138      0x000003ff7122469c
(gdb) x/12i 0x000003ff7122469c-40
   0x3ff71224674:       ret
   0x3ff71224678:       mov     x8, #0x28f0                     // #10480
   0x3ff7122467c:       movk    x8, #0xd592, lsl #16
   0x3ff71224680:       movk    x8, #0x3fd, lsl #32
   0x3ff71224684:       str     x8, [sp,#8]
   0x3ff71224688:       mov     x8, #0xffffffffffffffff         // #-1
   0x3ff7122468c:       str     x8, [sp]
   0x3ff71224690:       adrp    x8, 0x3ff70ab2000   <<< HERE
   0x3ff71224694:       add     x8, x8, #0x300      <<<
   0x3ff71224698:       blr     x8                  <<<
   0x3ff7122469c:       b       0x3ff712242f8       <<<
   0x3ff712246a0:       adrp    x8, 0x3ff70adf000

The code marked HERE is a out of line stub which is calling the resolve_opt_virtual_call stub. So far so good.

*** But this is not the correct code to call resolve_opt_virtual_call ****

This is in fact the code generated by the following from c1_CodeStubs_aarch64.cpp

void CounterOverflowStub::emit_code(LIR_Assembler* ce) {
  __ bind(_entry);
  Metadata *m = _method->as_constant_ptr()->as_metadata();
  __ mov_metadata(rscratch1, m);
  ce->store_parameter(rscratch1, 1);
  ce->store_parameter(_bci, 0);
  __ far_call(RuntimeAddress(Runtime1::entry_for(Runtime1::counter_overflow_id)));
  ce->add_call_info_here(_info);
  ce->verify_oop_map(_info);
  __ b(_continuation);
}

So this code is supposed to be calling Runtime1::counter_overflow. The -1 for the BCI is the InvocationEntryBci because this is an invocation entry counter overflow and it is this -1 which eventually causes the SEGV because it is being used as a genuine index into the bytecode to get a constant pool index for the invoke.

But is shouldn't be calling SharedRuntime::resolve_opt_virtual_call_C, it should be calling Runtime1::counter_overflow.

Tracing back where this out of line stub is called from

(gdb) x/10i 0x3ff712242f8-36
   0x3ff712242d4:       mov     x0, #0xc250                     // #49744
   0x3ff712242d8:       movk    x0, #0xd592, lsl #16
   0x3ff712242dc:       movk    x0, #0x3fd, lsl #32
   0x3ff712242e0:       ldr     w6, [x0,#220]
   0x3ff712242e4:       add     w6, w6, #0x8
   0x3ff712242e8:       str     w6, [x0,#220]
   0x3ff712242ec:       and     w6, w6, #0x1ff8
   0x3ff712242f0:       cmp     w6, #0x0
   0x3ff712242f4:       b.eq    0x3ff71224678   <<<< HERE is the b to the out of line stub
   0x3ff712242f8:       str     w5, [sp,#52]
(gdb)

So the above confirms that it is really doing a counter overflow but calling resolve_opt_virtual_call.

So I tried changing the 'far_call' method in macroAssembler_aarch64.cpp to use movz/movk/movk instead of adrp/add.

IE
    // We can use ADRP here because we know that the total size of
    // the code cache cannot exceed 2Gb.
    adrp(tmp, entry, offset);
    add(tmp, tmp, offset);

becomes

    // We can use ADRP here because we know that the total size of
    // the code cache cannot exceed 2Gb.
    movptr(tmp, (uintptr_t)entry.target());
    //adrp(tmp, entry, offset);
    //add(tmp, tmp, offset);

This cause GrinderBench to start working (at least, no failures after about 5000 runs).

So I changed this to read

    // We can use ADRP here because we know that the total size of
    // the code cache cannot exceed 2Gb.
    movptr(tmp, (uintptr_t)entry.target());
    adrp(tmp, entry, offset);
    add(tmp, tmp, offset);

IE. So it generate both the movz/movk/movk vsn and the adrp/add version but uses the adrp version discarding the result of the movz/movk/movk version.

Now when I list the out of line stub in gdb I get

(gdb) x/10i 0x000003ff5521d5dc-32
   0x3ff5521d5bc:       mov     x8, #0xffffffffffffffff         // #-1
   0x3ff5521d5c0:       str     x8, [sp]
   0x3ff5521d5c4:       mov     x8, #0x9780            <<< movz/movk/movk -> 0x3ff54c89780
   0x3ff5521d5c8:       movk    x8, #0x54c8, lsl #16
   0x3ff5521d5cc:       movk    x8, #0x3ff, lsl #32
   0x3ff5521d5d0:       adrp    x8, 0x3ff54ab2000      <<< adrp/add -> 0x3ff54ab2300
   0x3ff5521d5d4:       add     x8, x8, #0x300
   0x3ff5521d5d8:       blr     x8
   0x3ff5521d5dc:       b       0x3ff5521d308
   0x3ff5521d5e0:       mov     x8, #0xfc80                     // #64640

So the adrp/add and movz/movk/movk address different runtime stubs. Disassembling both shows that the adrp is addressing the resolve_opt_virtual_call stub and the movz/movk/movk is addressing the Runtime1::counter_overflow stub.

So it looks like the adrp is either not being relocated, or is being relocated incorrectly.

Any suggestions as to why it might be doing this??? I have had a long look at the pd_patch_* code and it seems correct to me.

Unfortunately it is difficult to debug because I cannot walk though it in gdb because of the infrequency (once in every few 100 runs) so I can only debug as above by looking at the core files generated.

Thanks for your help,
Ed.