Possible bug in nmethod reloc verification/scanning in G1

Thu Nov 21 20:46:15 UTC 2013

Thanks for the info. Yes, I cc'ed initially the compiler team but I guess
the original email is pending for verification.

Regards

Christos

On 11/21/13 7:31 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com> wrote:

>CC to compiler because it is about compiled nmethods and their state
>change. So it is interesting for us.
>
>Note, usually such methods are inlinied (accessor methods) and big
>methods have stack bang code at the beginning. Here is criteria used to
>generate stack bang:
>
>bool Compile::need_stack_bang(int frame_size_in_bytes) const {
>   // Determine if we need to generate a stack overflow check.
>   // Do it if the method is not a stub function and
>   // has java calls or has frame size > vm_page_size/8.
>   return (UseStackBanging && stub_function() == NULL &&
>           (has_java_calls() || frame_size_in_bytes >
>os::vm_page_size()>>3));
>}
>
>In your case stack bang is not generated that is why you have embedded
>oop at the beginning of the code.
>We can mark such nmethods so it will be easier to see such nmethod when
>we need unregister them.
>
>And thank you for catching both problems.
>
>Thanks,
>Vladimir
>
>On 11/21/13 5:54 AM, Thomas Schatzl wrote:
>> Hi,
>>
>> On Thu, 2013-11-21 at 14:42 +0100, Christos Kotselidis wrote:
>>> On 11/21/13 2:34 PM, "Thomas Schatzl" <thomas.schatzl at oracle.com>
>>>wrote:
>>>
>>>> Hi,
>>>>
>>>> On Thu, 2013-11-21 at 12:40 +0100, Christos Kotselidis wrote:
>>>>> Hi,
>>>>>
>>>>> I came across the following scenario, I already discussed with Thomas
>>>>>   Schatzl, that I would like to share with you.
>>>>> There is a method:
>>>>>
>>>>> private static final HotSpotGraalRuntime instance = new
>>>>> HotSpotGraalRuntime();
>>>>>
>>>>> public static HotSpotGraalRuntime runtime() {
>>>>>          return instance;
>>>>> }
>>>>
>>>>> The instance field is a constant oop in the code and the method is
>>>>> installed.
>>>>> At some stage there is a GC with post verification
>>>>> (including G1VerifyHeapRegionCodeRoots) that passes.
>>>>> Before the next GC cycle and during the mutation cycle, another
>>>>>method
>>>>> is being installed which causes the eviction of our method.
>>>>> Consequently our method is being made not_entrant and it is being
>>>>> patched with a jmp instruction at its verified entry point
>>>>> (nmethod.cpp::make_not_entrant_or_zombie). The patching causes the
>>>>>oop
>>>>> to be overwritten by the jmp instruction. Furthermore, that oop was
>>>>> the only oop that was responsible for attaching this method to its
>>>>> correspondent HeapRegion.
>>>>>
>>>>> The next GC cycle (Pre-verification) comes and the HeapRegionCodeRoot
>>>>> verification starts (nmethod.cpp::oops_do). There is a logic there
>>>>> (which I also believe it may be buggy as I will explain later) that
>>>>> checks whether the method is not_entrant. If the method is
>>>>>not_entrant
>>>>> then the logic correctly assumes that the method has been already
>>>>>been
>>>>> patched and therefore starts scanning the relocs of the method in +5
>>>>> bytes (for x86)/+4 bytes (for SPARC) (as the comment states). This
>>>>> results in not scanning the single oop that was attaching this
>>>>> specific method to its heap region. The verification then correctly
>>>>> asserts by saying that "there is a method attached to this specific
>>>>> heap region with no oops pointing into it".
>>>>
>>>> Thanks for finding this - I already filed a bug for that
>>>> (https://bugs.openjdk.java.net/browse/JDK-8028736). Please check if it
>>>> matches your investigation.
>>>>>
>>>>> Concerning the second bug in the method::oops_do logic, here is the
>>>>> code for manipulating the reloc starting address after a method has
>>>>> been patched:
>>>>
>>>>> // If the method is not entrant or zombie then a JMP is plastered
>>>>>over
>>>>> // the
>>>>> // first few bytes.  If an oop in the old code was there, that oop
>>>>> // should not get GC'd.  Skip the first few bytes of oops on
>>>>> // not-entrant methods.
>>>>>    address low_boundary = verified_entry_point();
>>>>>    if (is_not_entrant()) {
>>>>>      low_boundary += NativeJump::instruction_size;
>>>>>      // %%% Note:  On SPARC we patch only a 4-byte trap, not a full
>>>>> NativeJump.
>>>>>      // (See comment above.)
>>>>>    }
>>>>>
>>>>> The code above accounts only for the "is not entrant" scenario and
>>>>>not
>>>>> with the "is zombie" scenario.
>>>>
>>>> nmethod::do_oops will not be called for zombie methods in the default
>>>> case, so it does not matter. I.e. the default overload of
>>>> oops_do(Closure* cl, bool zombie) is oops_do(cl, false);
>>>
>>> When unregistering a method we call the oops_do allowing zombie methods
>>> (G1CollectedHeap::unregister_nmethod). That was causing a problem in my
>>> case and after adding "|| is_zombie()", it got fixed.
>>
>> I think you are right, we will need to fix that too as the oop map is
>> not changed after overwriting the oop in the first few bytes.
>>
>> Thomas
>>
>>