Possible bug in nmethod reloc verification/scanning in G1

Fri Nov 22 00:35:57 UTC 2013

Christos,

This must be a Graal-emitted method that fails. C1 & C2 take special care to emit enough bytes in the prologue to make space for the jump instruction of the patch. Graal should do the same.

The method in the example:

static final Object someobject = new Integer(0xdeadbabe);
  final static Object getSomeObject() {
    return someobject;
  }

Looks like this with C2:
  0x000000011185d440: sub $0x18,%rsp
  0x000000011185d447: mov %rbp,0x10(%rsp) ;*synchronization entry
                                                ; - X::getSomeObject at -1 (line 6)

  0x000000011185d44c: movabs $0x74006b240,%rax ; {oop(a 'java/lang/Integer' = -559039810)}
  0x000000011185d456: add $0x10,%rsp
  0x000000011185d45a: pop %rbp
  0x000000011185d45b: test %eax,-0x246a461(%rip) # 0x000000010f3f3000
                                                ; {poll_return}
  0x000000011185d461: retq

Notice a long version of the sub instruction allocating the frame, that's on purpose, to make sure that jump will fit. See MacroAssembler::verified_entry() in macroAssembler_x86.cpp.

C1, respectively generates either a stack bang (always!):
 0x0000000110fa5e20: mov %eax,-0x16000(%rsp)
  0x0000000110fa5e27: push %rbp
  0x0000000110fa5e28: sub $0x30,%rsp ;*getstatic someobject
                                                ; - X::getSomeObject at 0 (line 6)

  ;; block B0 [0, 3]

  0x0000000110fa5e2c: movabs $0x74006b428,%rax ; {oop(a 'java/lang/Integer' = -559039810)}
  0x0000000110fa5e36: add $0x30,%rsp
  0x0000000110fa5e3a: pop %rbp
  0x0000000110fa5e3b: test %eax,-0x262fd41(%rip) # 0x000000010e976100
                                                ; {poll_return}
  0x0000000110fa5e41: retq

Or if you turn stack bangs off it puts a 5-byte nop (to accommodate for the jump):

  0x00000001080a1f20: nopl 0x0(%rax,%rax,1)
  0x00000001080a1f25: push %rbp
  0x00000001080a1f26: sub $0x30,%rsp ;*getstatic someobject
                                                ; - X::getSomeObject at 0 (line 6)

  ;; block B0 [0, 3]

  0x00000001080a1f2a: movabs $0x74006b428,%rax ; {oop(a 'java/lang/Integer' = -559039810)}
  0x00000001080a1f34: add $0x30,%rsp
  0x00000001080a1f38: pop %rbp
  0x00000001080a1f39: test %eax,-0x24dde3f(%rip) # 0x0000000105bc4100
                                                ; {poll_return}
  0x00000001080a1f3f: retq

igor

On Nov 21, 2013, at 12:46 PM, Christos Kotselidis <christos.kotselidis at oracle.com> wrote:

> Thanks for the info. Yes, I cc'ed initially the compiler team but I guess
> the original email is pending for verification.
> 
> Regards
> 
> Christos
> 
> On 11/21/13 7:31 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com> wrote:
> 
>> CC to compiler because it is about compiled nmethods and their state
>> change. So it is interesting for us.
>> 
>> Note, usually such methods are inlinied (accessor methods) and big
>> methods have stack bang code at the beginning. Here is criteria used to
>> generate stack bang:
>> 
>> bool Compile::need_stack_bang(int frame_size_in_bytes) const {
>>  // Determine if we need to generate a stack overflow check.
>>  // Do it if the method is not a stub function and
>>  // has java calls or has frame size > vm_page_size/8.
>>  return (UseStackBanging && stub_function() == NULL &&
>>          (has_java_calls() || frame_size_in_bytes >
>> os::vm_page_size()>>3));
>> }
>> 
>> In your case stack bang is not generated that is why you have embedded
>> oop at the beginning of the code.
>> We can mark such nmethods so it will be easier to see such nmethod when
>> we need unregister them.
>> 
>> And thank you for catching both problems.
>> 
>> Thanks,
>> Vladimir
>> 
>> On 11/21/13 5:54 AM, Thomas Schatzl wrote:
>>> Hi,
>>> 
>>> On Thu, 2013-11-21 at 14:42 +0100, Christos Kotselidis wrote:
>>>> On 11/21/13 2:34 PM, "Thomas Schatzl" <thomas.schatzl at oracle.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> On Thu, 2013-11-21 at 12:40 +0100, Christos Kotselidis wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I came across the following scenario, I already discussed with Thomas
>>>>>>  Schatzl, that I would like to share with you.
>>>>>> There is a method:
>>>>>> 
>>>>>> private static final HotSpotGraalRuntime instance = new
>>>>>> HotSpotGraalRuntime();
>>>>>> 
>>>>>> public static HotSpotGraalRuntime runtime() {
>>>>>>         return instance;
>>>>>> }
>>>>> 
>>>>>> The instance field is a constant oop in the code and the method is
>>>>>> installed.
>>>>>> At some stage there is a GC with post verification
>>>>>> (including G1VerifyHeapRegionCodeRoots) that passes.
>>>>>> Before the next GC cycle and during the mutation cycle, another
>>>>>> method
>>>>>> is being installed which causes the eviction of our method.
>>>>>> Consequently our method is being made not_entrant and it is being
>>>>>> patched with a jmp instruction at its verified entry point
>>>>>> (nmethod.cpp::make_not_entrant_or_zombie). The patching causes the
>>>>>> oop
>>>>>> to be overwritten by the jmp instruction. Furthermore, that oop was
>>>>>> the only oop that was responsible for attaching this method to its
>>>>>> correspondent HeapRegion.
>>>>>> 
>>>>>> The next GC cycle (Pre-verification) comes and the HeapRegionCodeRoot
>>>>>> verification starts (nmethod.cpp::oops_do). There is a logic there
>>>>>> (which I also believe it may be buggy as I will explain later) that
>>>>>> checks whether the method is not_entrant. If the method is
>>>>>> not_entrant
>>>>>> then the logic correctly assumes that the method has been already
>>>>>> been
>>>>>> patched and therefore starts scanning the relocs of the method in +5
>>>>>> bytes (for x86)/+4 bytes (for SPARC) (as the comment states). This
>>>>>> results in not scanning the single oop that was attaching this
>>>>>> specific method to its heap region. The verification then correctly
>>>>>> asserts by saying that "there is a method attached to this specific
>>>>>> heap region with no oops pointing into it".
>>>>> 
>>>>> Thanks for finding this - I already filed a bug for that
>>>>> (https://bugs.openjdk.java.net/browse/JDK-8028736). Please check if it
>>>>> matches your investigation.
>>>>>> 
>>>>>> Concerning the second bug in the method::oops_do logic, here is the
>>>>>> code for manipulating the reloc starting address after a method has
>>>>>> been patched:
>>>>> 
>>>>>> // If the method is not entrant or zombie then a JMP is plastered
>>>>>> over
>>>>>> // the
>>>>>> // first few bytes.  If an oop in the old code was there, that oop
>>>>>> // should not get GC'd.  Skip the first few bytes of oops on
>>>>>> // not-entrant methods.
>>>>>>   address low_boundary = verified_entry_point();
>>>>>>   if (is_not_entrant()) {
>>>>>>     low_boundary += NativeJump::instruction_size;
>>>>>>     // %%% Note:  On SPARC we patch only a 4-byte trap, not a full
>>>>>> NativeJump.
>>>>>>     // (See comment above.)
>>>>>>   }
>>>>>> 
>>>>>> The code above accounts only for the "is not entrant" scenario and
>>>>>> not
>>>>>> with the "is zombie" scenario.
>>>>> 
>>>>> nmethod::do_oops will not be called for zombie methods in the default
>>>>> case, so it does not matter. I.e. the default overload of
>>>>> oops_do(Closure* cl, bool zombie) is oops_do(cl, false);
>>>> 
>>>> When unregistering a method we call the oops_do allowing zombie methods
>>>> (G1CollectedHeap::unregister_nmethod). That was causing a problem in my
>>>> case and after adding "|| is_zombie()", it got fixed.
>>> 
>>> I think you are right, we will need to fix that too as the oop map is
>>> not changed after overwriting the oop in the first few bytes.
>>> 
>>> Thomas
>>> 
>>> 
> 
>