need help on bug 6894652

Mon Dec 7 07:48:15 PST 2009

Hi Yumin,

Sorry I just saw this with SWAN available.  To put a webrev to openjdk 
do this (substituting your name and webrev directory):

echo "scp -r /home/coleenp/public_html/webrev/$1 
coleenp at cr.openjdk.java.net:"
echo open webrev at http://cr.openjdk.java.net/~coleenp/$1/

The openjdk alias is:  hotspot-runtime-dev at openjdk.java.net

I was looking at the webrev again, and do you ever reset this to zero, 
after the thread stack walking is done for all the threads?

+       ((NamedThread *)thread)->set_java_thread_gc(this);

Thanks,
Coleen

On 12/04/09 20:45, Yumin Qi wrote:
> Hi, Coleen
>
>  I have prepared the webrev according to your suggestions. But I am 
> not sure which one is the open alias, could you send me the right 
> alias name?
>
>
> Thanks
> Yumin
>
> On 12/03/09 16:31, Coleen Phillimore wrote:
>>
>> Hi Yumin,
>> Sorry this got flooded in my mailbox yesterday with Ramki's Intel bug 
>> mail and other so I didn't see it.
>> This change looks fantastic!  I wouldn't have expected a performance 
>> degradation but it's great that you checked.  Also, I like the idea 
>> of having a -XX:+CrashGCForDumpingJavaThread for verifying that the 
>> code doesn't bit-rot in between customers hitting it, but it'd still 
>> require someone to run a test with the flag which probably won't 
>> happen too much.
>>
>> I think this should be backported to older versions of the jvm too.
>>
>> When you send it out for review, send it to the open alias.  They 
>> will like this code too.
>>
>> Thanks!!
>> Coleen
>>
>> Yumin Qi wrote:
>>> Coleen,
>>>
>>>  Please have a look at the code change via link:
>>>  http://javaweb.sfbay/~yq123930/webrev/6361589
>>>
>>>  I have done test using volano to check if there is performance 
>>> degradation. The result showed OK:
>>>
>>>  result_sparc_fcs:      
>>> /java/re/jdk/7/promoted/latest/binaries/solaris-sparc
>>>  result_sparc_new:   hg.openjdk.java.net/jdk7/hotspot-rt/hotspot  + 
>>> this change.
>>>
>>> $ sh compare volano_result_sparc_new volano_result_sparc_fcs
>>> ============================================================================ 
>>>
>>> volano_result_sparc_new
>>>  Benchmark         Samples        Mean     Stdev             Geomean 
>>> Weight
>>>  volano                 20    43971.86   6081.44
>>> ============================================================================ 
>>>
>>> volano_result_sparc_fcs
>>>  Benchmark         Samples        Mean     Stdev   %Diff    P   
>>> Significant
>>>  volano                 20    46286.67   7288.64    5.26 
>>> 0.283            *
>>> ============================================================================ 
>>>
>>>  * - Not Significant: The %Diff for the mean could be noise. If 
>>> there is
>>>      an actual difference, more samples would be needed to detect it.
>>>
>>>  Also, I have set up some error code to cause SEGV manually in 
>>> frame::oops_do to verify the output. The output is as expected, 
>>> after verification, the code inserted was removed. If you think this 
>>> version is okay, I will submit a codereview for integration.
>>>
>>>  (To verify this code, the SEGV-ing code can be wrapped by a debug 
>>> flag, such as -XX:+CrashGCForDumpingJavaThread, in debug, when this 
>>> flag set, trigger a SEGV in frame::oops_do so the output will be 
>>> handled in VMError::report(), but I don't think we need this extra 
>>> code to verify it).
>>>
>>>
>>> Thanks
>>> Yumin
>>>
>>> Coleen Phillimore wrote:
>>>>
>>>> Hi Yumin,
>>>>
>>>> I was thinking exactly the second way, but I couldn't figure out 
>>>> where to put extra data member to point to the thread that it's 
>>>> walking.  I didn't look at it for very long though.  It sounds like 
>>>> you have figured that out.  I think the extra data member won't 
>>>> matter or even setting it up, performance wise and the GC people 
>>>> would so appreciate the extra information it would be worth it.
>>>> The first way seems like it might be more code and more likely to 
>>>> crash during error reporting which wouldn't be good, so I 
>>>> completely agree with your idea.
>>>> This is great- thanks for taking this!!
>>>>
>>>> Coleen
>>>>
>>>> Yumin Qi wrote:
>>>>> Coleen,
>>>>>
>>>>>  I checked the code, thought there might be two ways to output the 
>>>>> related JavaThread information (if it is involved in GC thread  
>>>>> crashed).
>>>>>  When process crashed, we call report in VMError, but there is no 
>>>>> record of JavaThread in failed GC thread so we have to find if 
>>>>> there is a JavaThread involved.
>>>>>
>>>>>  1) when printing out current thread (failed one) stacks, check 
>>>>> the data in current frame, if it is belong to stack of a 
>>>>> JavaThread, we know it "could" be this JavaThread involved. Need 
>>>>> to check every data in this frame against all the stack limits of 
>>>>> all JavaThreads. So it will cost much time to iterate all the data 
>>>>> in one frame then for all the frames. If found, keep this 
>>>>> JavaThread then do next.
>>>>>
>>>>>  2) JavaThread::oops_do, this function is called by GC threads, VM 
>>>>> Thread, ConcurrentGCthread . We can have an extra data member in 
>>>>> Thread (or vmThread, GCTaskThread, GangWorker, ConcurrentGCThread) 
>>>>> to record the JavaThread. After crash, we will be in signal 
>>>>> handler, call to VMError::report(...). If failed thread is one of 
>>>>> above threads, check if the data member of JavaThread , if it is 
>>>>> valid one do next.
>>>>>
>>>>>  Now we have JavaThread, from above two, we can print out stacks, 
>>>>> java methods etc information from it.
>>>>>
>>>>>  I prefer second way, it is simple, but need to add extra data 
>>>>> member, may not make others happy. Also, in GC, extra code to 
>>>>> record its value. I can test if there are much degradation  
>>>>> between no-change vs change.
>>>>>
>>>>>  For first way, It cost much  more time (but it does not matter, 
>>>>> the process already died, who cares about take more time to print 
>>>>> out more information?), we need to find if a JavaThread in its 
>>>>> valid stack frames. But if there are thousands of threads, it will 
>>>>> be taking too much time.
>>>>>
>>>>>  Do you have different idea how to implement this functionality? 
>>>>> Please let me know.
>>>>>
>>>>> Thanks
>>>>> Yumin
>>>>>
>>>>>
>>>>> Coleen Phillimore - Sun Microsystems wrote:
>>>>>>
>>>>>> The RFE is 6361589  .. You'd be a good person to add this support 
>>>>>> because you do this all the time in the debugger!
>>>>>>
>>>>>> Thanks!!
>>>>>> Coleen
>>>>>>
>>>>>> On 11/12/09 12:44, Yumin Qi wrote:
>>>>>>> Coleen,
>>>>>>>
>>>>>>>  Do you have the bug id for the RFE, I would like to work on it.
>>>>>>>  BTW, customer got fastdebug to run, hope we can get more closer 
>>>>>>> look at the failure if any.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Yumin
>>>>>>>
>>>>>>> Coleen Phillimore wrote:
>>>>>>>> It would be good to know the bytecodes of the method at bci 39 
>>>>>>>> and see if local (or stack)[5] is supposed to be an oop.
>>>>>>>> CBOE is usually good about running fastdebug vm, can they get 
>>>>>>>> this to crash with that?    We'd have more information.
>>>>>>>> I don't know the answer to your last question about the frame 
>>>>>>>> containing pc not being real frame.
>>>>>>>> I have an RFE to print out the thread stack of the thread that 
>>>>>>>> GC is working on if it crashes in the hs_err file.  That would 
>>>>>>>> have saved a lot of debugging (and we'd know the method to run 
>>>>>>>> javap on).  It was lowered in priority because we hadn't seen 
>>>>>>>> this sort of crash for a while.  Yumin, since you do this 
>>>>>>>> manually all the time maybe you can add code to do this.
>>>>>>>> Also, you might be able to manually figure out the method from 
>>>>>>>> the address, then we can have CBOE run javap on it and see the 
>>>>>>>> bytecodes around 39.  Otherwise see if they can reproduce this 
>>>>>>>> with fastdebug.
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> Coleen
>>>>>>>>
>>>>>>>> Yumin Qi wrote:
>>>>>>>>> Hi, Tom and Coleen
>>>>>>>>>
>>>>>>>>>  I am working on a bug filed by CBOE, it run on solaris-i586 
>>>>>>>>> 32bit on Solaris 10 with JDK 6u04.
>>>>>>>>>
>>>>>>>>>  The location for core file is: 
>>>>>>>>> /net/cores/cores/dir23/71840156/102109_case1
>>>>>>>>>  (use dbx -s dbxrc to read the core file)
>>>>>>>>>  When it crashed, the GC thread was iterating over a 
>>>>>>>>> JavaThread and an oop in stack is not a valid oop.
>>>>>>>>> The GC thread is t at 6:
>>>>>>>>> current thread: t at 6
>>>>>>>>>  [1] __lwp_kill(0x6, 0x6), at 0xfef254d7
>>>>>>>>>  [2] _thr_kill(0x6, 0x6), at 0xfef22c86
>>>>>>>>>  [3] raise(0x6), at 0xfeed13f3
>>>>>>>>>  [4] abort(0xfecb8000, 0xfe24a4f8, 0xfe24a4f8, 0xfeb7d838, 
>>>>>>>>> 0x1, 0x807f9d0), at 0xfeeb1709
>>>>>>>>>  [5] os::abort(0x1), at 0xfead7440
>>>>>>>>>  [6] VMError::report_and_die(0xfe24a690), at 0xfeb7d838
>>>>>>>>>  [7] JVM_handle_solaris_signal(0xb, 0xfe24a9b4, 0xfe24a7b4, 
>>>>>>>>> 0x1), at 0xfe71bf0b
>>>>>>>>>  [8] signalHandler(0xb, 0xfe24a9b4, 0xfe24a7b4), at 0xfe71b7c2
>>>>>>>>>  [9] __sighndlr(0xb, 0xfe24a9b4, 0xfe24a7b4, 0xfe71b79c), at 
>>>>>>>>> 0xfef2489f
>>>>>>>>>  [10] call_user_handler(0xb, 0xfe24a9b4, 0xfe24a7b4), at 
>>>>>>>>> 0xfef1a75e
>>>>>>>>>  [11] sigacthandler(0xb, 0xfe24a9b4, 0xfe24a7b4, 0xf, 0x0, 
>>>>>>>>> 0x4), at 0xfef1a8de
>>>>>>>>>  ---- called from signal handler with signal 11 (SIGSEGV) ------
>>>>>>>>>  [12] 
>>>>>>>>> ParNewGeneration::copy_to_survivor_space_avoiding_promotion_undo(0x8081b70, 
>>>>>>>>> 0xe322738, 0x8d8505f8, 0x3b88c022, 0x0), at 0xfeae2862
>>>>>>>>>  [13] ParScanClosure::do_oop_work(0xe3227ec, 0x6dbce8dc, 0x0, 
>>>>>>>>> 0x1), at 0xfeae3454
>>>>>>>>>  [14] ParRootScanWithoutBarrierClosure::do_oop(0xe3227ec, 
>>>>>>>>> 0x6dbce8dc), at 0xfeae3343
>>>>>>>>>  [15] InterpreterFrameClosure::offset_do(0xfe24ab74, 0x5), at 
>>>>>>>>> 0xfe6cb893
>>>>>>>>>  [16] InterpreterOopMap::iterate_oop(0xfe24ab88, 0xfe24ab74), 
>>>>>>>>> at 0xfe6db740
>>>>>>>>> =>[17] frame::oops_interpreted_do(0xfe24ac34, 0xe3227ec, 
>>>>>>>>> 0xfe24ac4c, 0x1), at 0xfe94e11f
>>>>>>>>>  [18] frame::oops_do_internal(0xfe24ac34, 0xe3227ec, 
>>>>>>>>> 0xfe24ac4c, 0x1), at 0xfe94e77a
>>>>>>>>>  [19] JavaThread::oops_do(0x9dc1e20, 0xe3227ec), at 0xfe6e4a03
>>>>>>>>>  [20] Threads::possibly_parallel_oops_do(0xe3227ec), at 
>>>>>>>>> 0xfeb4758f
>>>>>>>>>  [21] SharedHeap::process_strong_roots(0x807cfe8, 0x0, 0x1, 
>>>>>>>>> 0xe3227ec, 0xe322810), at 0xfeb0a85f
>>>>>>>>>  [22] GenCollectedHeap::gen_process_strong_roots(0x807cfe8, 
>>>>>>>>> 0x0, 0x1, 0x0, 0x1, 0xe322810, 0xe3227ec), at 0xfe954a49
>>>>>>>>>  [23] ParNewGenTask::work(0x759feb18, 0x4), at 0xfeae1a63
>>>>>>>>>  [24] GangWorker::loop(0x807f9d0), at 0xfeb82272
>>>>>>>>>  [25] GangWorker::run(0x807f9d0), at 0xfeb8215b
>>>>>>>>>  [26] java_start(0x807f9d0), at 0xfead6d62
>>>>>>>>>  [27] _thr_setup(0xfe4b2200), at 0xfef244b7
>>>>>>>>>  [28] _lwp_start(0x6, 0x6, 0xfef4e000, 0xfe24a3fc, 0xfeed13f3, 
>>>>>>>>> 0x6), at 0xfef247a0
>>>>>>>>>
>>>>>>>>> The JavaThread it was iterating is t at 18098
>>>>>>>>> frame info:
>>>>>>>>> *((class frame *) 0xfe24ac34U) = {
>>>>>>>>>    frame::_sp            = 0x6dbce8d8
>>>>>>>>>    frame::_pc            = 0xfb2126ed "\xe9L"
>>>>>>>>>    frame::_cb            = 0xfb202508
>>>>>>>>>    frame::_deopt_state   = not_deoptimized
>>>>>>>>>    frame::_fp            = 0x6dbce908
>>>>>>>>>    frame::_unextended_sp = 0x6dbce8d8
>>>>>>>>>
>>>>>>>>> *(InterpreterOopMap*)0xfe24ab88
>>>>>>>>> *((class InterpreterOopMap *) 0xfe24ab88U) = {
>>>>>>>>>    InterpreterOopMap::_method                = 0xf468c368
>>>>>>>>>    InterpreterOopMap::_bci                   = 39U
>>>>>>>>>    InterpreterOopMap::_mask_size             = 6
>>>>>>>>>    InterpreterOopMap::_expression_stack_size = 1
>>>>>>>>>    InterpreterOopMap::_bit_mask              = (47, 0)
>>>>>>>>>
>>>>>>>>> The offset in the interpreter frame is 5.
>>>>>>>>>
>>>>>>>>> JavaThread is t at 18098:
>>>>>>>>> current thread: t at 18098
>>>>>>>>>  [1] ___lwp_mutex_timedlock(0x80790d8, 0x0), at 0xfef25547
>>>>>>>>>  [2] _private_lwp_mutex_lock(0x80790d8), at 0xfef1183c
>>>>>>>>>  [3] Mutex::lock_without_safepoint_check(0x8079080), at 
>>>>>>>>> 0xfe6c22d3
>>>>>>>>>  [4] SafepointSynchronize::block(0x9dc1e20), at 0xfe7081c1
>>>>>>>>>  [5] InterpreterRuntime::monitorexit(0x9dc1e20, 0x6dbce8e0, 
>>>>>>>>> 0xfb2126ed, 0xa8f05060, 0x8d8505f8, 0x3), at 0xfe6f8164
>>>>>>>>> =>[6] 0xfb212713(0x0, 0xa8f05060, 0xceea7ad8, 0xaa3a70e8, 
>>>>>>>>> 0x90f36230, 0x9dc1e20), at 0xfb212713
>>>>>>>>>  [7] 0xfb20028f(0x6dbce970, 0x6dbcebc8, 0xa, 0xf468c368, 
>>>>>>>>> 0xfb208d00, 0x6dbcead8, 0x1, 0x9dc1e20), at 0xfb20028f
>>>>>>>>>  [8] JavaCalls::call_helper(0x6dbcebc4, 0x6dbcea48, 
>>>>>>>>> 0x6dbcead4, 0x9dc1e20), at 0xfe6f476b
>>>>>>>>>  [9] os::os_exception_wrapper(0xfe6f45c8, 0x6dbcebc4, 
>>>>>>>>> 0x6dbcea48, 0x6dbcead4, 0x9dc1e20), at 0xfe6f4a13
>>>>>>>>>  [10] JavaCalls::call(0x6dbcebc4, 0x960eb48, 0x6dbcead4, 
>>>>>>>>> 0x9dc1e20), at 0xfe6f4a4b
>>>>>>>>>  [11] JavaCalls::call_virtual(0x6dbcebc4, 0x960eb3c, 
>>>>>>>>> 0xfed0daf4, 0xfed0dd58, 0x6dbcead4, 0x9dc1e20), at 0xfe76b069
>>>>>>>>>  [12] JavaCalls::call_virtual(0x6dbcebc4, 0x960eb38, 
>>>>>>>>> 0x960eb3c, 0xfed0daf4, 0xfed0dd58, 0x9dc1e20), at 0xfe773532
>>>>>>>>>  [13] thread_entry(0x9dc1e20, 0x9dc1e20), at 0xfe784ca6
>>>>>>>>>  [14] JavaThread::thread_main_inner(0x9dc1e20), at 0xfe7811d8
>>>>>>>>>  [15] JavaThread::run(0x9dc1e20), at 0xfe781182
>>>>>>>>>  [16] java_start(0x9dc1e20), at 0xfead6d62
>>>>>>>>>  [17] _thr_setup(0x6e560a00), at 0xfef244b7
>>>>>>>>>  [18] _lwp_start(0x80790d8, 0x0, 0xfecb8000, 0x6dbce858, 
>>>>>>>>> 0xfe6c22d3, 0x80790d8), at 0xfef247a0
>>>>>>>>>
>>>>>>>>> frame 6 is the java frame which was being checked by t at 6, the 
>>>>>>>>> frame:
>>>>>>>>> 0xfb2126d3:     testl    %ebx,%ebx
>>>>>>>>> 0xfb2126d5:     je       0xfb21273e     [ 0xfb21273e, .+0x69 ]
>>>>>>>>> 0xfb2126db:     lock cmpxchgl  %ebx,(%ecx)
>>>>>>>>> 0xfb2126df:     je       0xfb21273e     [ 0xfb21273e, .+0x5f ]
>>>>>>>>> 0xfb2126e5:     movl     %ecx,0x00000004(%edx)
>>>>>>>>> 0xfb2126e8:     call     0xfb2126f2     [ 0xfb2126f2, .+0xa 
>>>>>>>>> ]         /////  this call to an instruction in this frame
>>>>>>>>> 0xfb2126ed:     jmp      0xfb21273e     [ 0xfb21273e, .+0x51 
>>>>>>>>> ]    /////    this address 0xfb2126e8:  the pc  in frame
>>>>>>>>> 0xfb2126f2:     pushl    
>>>>>>>>> %edx                                                        
>>>>>>>>> //// this is the called address
>>>>>>>>> 0xfb2126f3:     leal     0x00000008(%esp),%eax
>>>>>>>>> 0xfb2126f7:     movl     %esi,0xffffffe4(%ebp)
>>>>>>>>> 0xfb2126fa:     movl     %gs:0x000000e0 [ 0xe0 ],%edi
>>>>>>>>> 0xfb212701:     pushl    %edi
>>>>>>>>> 0xfb212702:     movl     %ebp,0x000000ec(%edi)
>>>>>>>>> 0xfb212708:     movl     %eax,0x000000e4(%edi)
>>>>>>>>> 0xfb21270e:     call     monitorexit    [ 0xfe6f802c, 
>>>>>>>>> .+0x34e591e ]    //// next frame
>>>>>>>>> 0xfb212713:     movl     
>>>>>>>>> $0x00000000,0x000000e4(%edi)               ////  real frame pc 
>>>>>>>>> is here!!!!!
>>>>>>>>> 0xfb21271d:     movl     $0x00000000,0x000000ec(%edi)
>>>>>>>>> 0xfb212727:     addl     $0x00000008,%esp
>>>>>>>>> 0xfb21272a:     cmpl     $0x00000000,0x00000004(%edi)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is no deoptimization involved, the pc in 
>>>>>>>>> Thread::JavaFrameAncore is null, so the pc was calculated with 
>>>>>>>>> sp[-1], which is the value we see here:
>>>>>>>>> 0x6dbce8d4:      0xfb2126ed
>>>>>>>>>
>>>>>>>>> But the frame containing this pc is not a real frame I 
>>>>>>>>> think(maybe I am wrong?), it just call address 0xfb2126f2 and 
>>>>>>>>> caused the instruction address after it pushed on to stack. I 
>>>>>>>>> don't know how the frame cooking take this as a frame. If it 
>>>>>>>>> is not a real frame, the OopMap calculation based on it could 
>>>>>>>>> be wrong, and the value in the stack slot could be invalid. If 
>>>>>>>>> there is no problem about the frame info here, then the value 
>>>>>>>>> on stack may come from last GC.
>>>>>>>>>
>>>>>>>>> searched invalid oop: 0x8d8505f8 in java heap, non exist.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Yumin
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>