Possible bug in nmethod reloc verification/scanning in G1

Thu Nov 21 18:31:26 UTC 2013

CC to compiler because it is about compiled nmethods and their state 
change. So it is interesting for us.

Note, usually such methods are inlinied (accessor methods) and big 
methods have stack bang code at the beginning. Here is criteria used to 
generate stack bang:

bool Compile::need_stack_bang(int frame_size_in_bytes) const {
   // Determine if we need to generate a stack overflow check.
   // Do it if the method is not a stub function and
   // has java calls or has frame size > vm_page_size/8.
   return (UseStackBanging && stub_function() == NULL &&
           (has_java_calls() || frame_size_in_bytes > 
os::vm_page_size()>>3));
}

In your case stack bang is not generated that is why you have embedded 
oop at the beginning of the code.
We can mark such nmethods so it will be easier to see such nmethod when 
we need unregister them.

And thank you for catching both problems.

Thanks,
Vladimir

On 11/21/13 5:54 AM, Thomas Schatzl wrote:
> Hi,
>
> On Thu, 2013-11-21 at 14:42 +0100, Christos Kotselidis wrote:
>> On 11/21/13 2:34 PM, "Thomas Schatzl" <thomas.schatzl at oracle.com> wrote:
>>
>>> Hi,
>>>
>>> On Thu, 2013-11-21 at 12:40 +0100, Christos Kotselidis wrote:
>>>> Hi,
>>>>
>>>> I came across the following scenario, I already discussed with Thomas
>>>>   Schatzl, that I would like to share with you.
>>>> There is a method:
>>>>
>>>> private static final HotSpotGraalRuntime instance = new
>>>> HotSpotGraalRuntime();
>>>>
>>>> public static HotSpotGraalRuntime runtime() {
>>>>          return instance;
>>>> }
>>>
>>>> The instance field is a constant oop in the code and the method is
>>>> installed.
>>>> At some stage there is a GC with post verification
>>>> (including G1VerifyHeapRegionCodeRoots) that passes.
>>>> Before the next GC cycle and during the mutation cycle, another method
>>>> is being installed which causes the eviction of our method.
>>>> Consequently our method is being made not_entrant and it is being
>>>> patched with a jmp instruction at its verified entry point
>>>> (nmethod.cpp::make_not_entrant_or_zombie). The patching causes the oop
>>>> to be overwritten by the jmp instruction. Furthermore, that oop was
>>>> the only oop that was responsible for attaching this method to its
>>>> correspondent HeapRegion.
>>>>
>>>> The next GC cycle (Pre-verification) comes and the HeapRegionCodeRoot
>>>> verification starts (nmethod.cpp::oops_do). There is a logic there
>>>> (which I also believe it may be buggy as I will explain later) that
>>>> checks whether the method is not_entrant. If the method is not_entrant
>>>> then the logic correctly assumes that the method has been already been
>>>> patched and therefore starts scanning the relocs of the method in +5
>>>> bytes (for x86)/+4 bytes (for SPARC) (as the comment states). This
>>>> results in not scanning the single oop that was attaching this
>>>> specific method to its heap region. The verification then correctly
>>>> asserts by saying that "there is a method attached to this specific
>>>> heap region with no oops pointing into it".
>>>
>>> Thanks for finding this - I already filed a bug for that
>>> (https://bugs.openjdk.java.net/browse/JDK-8028736). Please check if it
>>> matches your investigation.
>>>>
>>>> Concerning the second bug in the method::oops_do logic, here is the
>>>> code for manipulating the reloc starting address after a method has
>>>> been patched:
>>>
>>>> // If the method is not entrant or zombie then a JMP is plastered over
>>>> // the
>>>> // first few bytes.  If an oop in the old code was there, that oop
>>>> // should not get GC'd.  Skip the first few bytes of oops on
>>>> // not-entrant methods.
>>>>    address low_boundary = verified_entry_point();
>>>>    if (is_not_entrant()) {
>>>>      low_boundary += NativeJump::instruction_size;
>>>>      // %%% Note:  On SPARC we patch only a 4-byte trap, not a full
>>>> NativeJump.
>>>>      // (See comment above.)
>>>>    }
>>>>
>>>> The code above accounts only for the "is not entrant" scenario and not
>>>> with the "is zombie" scenario.
>>>
>>> nmethod::do_oops will not be called for zombie methods in the default
>>> case, so it does not matter. I.e. the default overload of
>>> oops_do(Closure* cl, bool zombie) is oops_do(cl, false);
>>
>> When unregistering a method we call the oops_do allowing zombie methods
>> (G1CollectedHeap::unregister_nmethod). That was causing a problem in my
>> case and after adding "|| is_zombie()", it got fixed.
>
> I think you are right, we will need to fix that too as the oop map is
> not changed after overwriting the oop in the first few bytes.
>
> Thomas
>
>