RFR: 8209971: TestOptionsWithRanges.java crashes in CDS mode with G1UpdateBufferSize=1

Jiangli Zhou jiangli.zhou at oracle.com
Thu Sep 6 17:28:19 UTC 2018


Hi Ioi,


On 9/6/18 12:04 AM, Ioi Lam wrote:
>
>
> On 9/5/18 10:10 PM, Jiangli Zhou wrote:
>> Hi Ioi,
>>
>> Thanks for the quick review.
>>
>> On 9/5/18 8:46 PM, Ioi Lam wrote:
>>> Hi Jiangli,
>>>
>>> Thanks for fixing this! I like the simple approach that minimizes 
>>> code change.
>>>
>>> The following comment is misleading. It suggest that mirror objects 
>>> cannot be restored at this stage, but doesn't explain why.
>>>
>>> Also, I don't think we should single out constant pools (the crash 
>>> call stack had nothing to do with constant pool.)
>>>
>>> 2039     // No mirror object is access/restored during
>>> 2040     // 
>>> resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass)...).
>>> 2041     // Mirrors are restored after java.lang.Class is loaded.
>>> 2042     //
>>> 2043     // Object_klass()->constants()->restore_unshareable_info() 
>>> restores the
>>> 2044     // constant pool resolved_references array for Object 
>>> klass. It must be
>>> 2045     // done after MetaspaceShared::fixup_mapped_heap_regions().
>> All constant pool resolved_references arrays for shared classes are 
>> archived (like the mirrors). Restoration of the resolved_references 
>> (of the Object klass) array object is done as part of the 
>> Object_klass()->constants()->restore_unshareable_info(). Although it 
>> wasn't shown in the backtrace of this specific crash, it can be an 
>> issue as long it is done before the mapped archive regions are fixed 
>> up. That's why MetaspaceShared::fixup_mapped_heap_regions() needs to 
>> be called before 
>> Object_klass()->constants()->restore_unshareable_info(). Any 
>> operation to the archived java objects can be potentially unsafe if 
>> it is done before MetaspaceShared::fixup_mapped_heap_regions(). So we 
>> want to place the MetaspaceShared::fixup_mapped_heap_regions() call 
>> as early as possible but after SystemDictionary::Object_klass() is 
>> set up. Please let me know if adding some of these explanations would 
>> help make the comments more clear.
>>
> We should have concise comments that clearly describes the issues, and 
> avoid repeating things that's already in the code. So sentences like 
> "Fixup the mapped archive heap regions after resolving Object_klass" 
> doesn't seem that useful.
>
> How about this:
>
>     resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass), scan, 
> CHECK);
>     // It's unsafe to access the archived heap regions before they
>     // are fixed up, so we must do the fixup as early as possible before
>     // the archived java objects are accessedby function such as
>     // java_lang_Class::restore_archived_mirror
>     // and ConstantPool::restore_unshareable_info.
>     //
>     // However, MetaspaceShared::fixup_mapped_heap_regions() fills the
>     // leading empty spaces in the archived regions and may use
>     // SystemDictionary::Object_klass(), so we can do this only after
>     // Object_klass is resolved.
>     //
>     // Note: no mirror objects are accessed/restored in the above call.
>     // Mirrors are restored after java.lang.Class is loaded.
>     MetaspaceShared::fixup_mapped_heap_regions();
This looks ok. I'll make adjustment as suggested. To prevent 
Object_klass()->constants()->restore_unshareable_info() accidentally 
being move above MetaspaceShared::fixup_mapped_heap_regions() in the 
future, the following comment should be included explicitly.

2043     // Object_klass()->constants()->restore_unshareable_info() 
restores the
2044     // constant pool resolved_references array for Object klass. It 
must be
2045     // done after MetaspaceShared::fixup_mapped_heap_regions().
>
>
> BTW, while you're fixing this area, I think we need an assert here:
>
> void ConstantPool::restore_unshareable_info(TRAPS) {
>   ..
>   if (SystemDictionary::Object_klass_loaded()) {
>     ...
>   } else {
> +  assert(pool_holder()->name() == vmSymbols::java_lang_Object(), 
> "must be");
>    }
Ok.
>
> That's because you have special handling for the Object klass, but no 
> other klasses. However, the check in 
> ConstantPool::restore_unshareable_info currently doesn't explicitly 
> match this.
>
>     // Initialize the constant pool for the Object_class
> Object_klass()->constants()->restore_unshareable_info(CHECK);
>
>
>> Just to give some background/history information to help understand. 
>> The mapping/fixing-up of the archive heap region was designed and 
>> implemented as part of the shared String support. No archived String 
>> objects were accessed before the region fixup operation. So this was 
>> not an issue. Later on, open archive heap region was introduced and 
>> archiving supports for resolved_reference arrays and mirrors were 
>> implemented. That introduced the hidden issue because some of the 
>> archived mirrors and resolved_references arrays were accessed early 
>> (before region fixup). The issue was undiscovered because the mapped 
>> region start usually was the same as the start of the GC region that 
>> contains the mapped data. There was no gap at the top of the mapped 
>> GC regions in normal cases. We recently started to stress testing 
>> this area heavily with relocations for the default CDS archive and 
>> the issue then surfaced.
>>
>>>
>>> The real reason we have to do this is:
>>>
>>> + Before MetaspaceShared::fixup_mapped_heap_regions() is called, the 
>>> heap is in an inconsistent state. At some G1 regions, the bottom 
>>> part contains all zeros. However, some GC operations (debug-build 
>>> verifications only?) would need to walk G1 regions starting from the 
>>> bottom, and will run into asserts when we get an uninitialized block.
>> Just want to make it clear, only the mapped archived heap regions may 
>> have this issue when the mapped region start is different from the 
>> containing GC region's start. None of the regular GC regions have 
>> such issue.
>>
>>>
>>> + One of such operation is the barrier operations, as shown in the 
>>> crash call stack in the bug report. (Maybe there are others??)
>> Any operation to the archived java objects could have issue if it is 
>> done before archive region fixup. That's why we want to do the 
>> archive region fixup as early as possible.
>>>
>>> For safety, we should put in asserts such that -- before 
>>> MetaspaceShared::fixup_mapped_heap_regions() is done, we must not do 
>>> anything that would trigger barrier operations. For example, we must 
>>> not call oopDesc::obj_field_put, which triggered the barrier 
>>> operations in the crash stack.
>>>
>>> The crash is timing related and happened only when 
>>> G1UpdateBufferSize was set to one. I think we need asserts so we can 
>>> reliably catch any future problems. For example, if someone adds new 
>>> code to modify obj fields in the archived heap before 
>>> fixup_mapped_heap_regions() is called, we should get an assert 
>>> immediately, instead of waiting for some future point that may or 
>>> may not happen.
>>>
>>> I would suggest working with the GC folks for a suitable assert.
>> Not sure if there is a central place within GC code can be used for 
>> the assertion, as any operation could be an issue. We can check with 
>> the GC experts.
>>
>> If we want to be extra cautious, we could map the archive heap 
>> regions without access permission (PROT_NONE) initially. The archive 
>> region fixup code can then reset memory protection and makes it 
>> readable/writable. With that, any operations to the archived java 
>> objects before fixup can trigger signal. This probably should be done 
>> for debugging only build and should be handled separately.
>>
>
> Making the memory read-only initially seems a very good idea. I don't 
> think we should put it off in a future RFE. This bug happened because 
> we don't have a safeguard mechanism, so the bug fix should include 
> such a mechanism.
I briefly looked into it already. I think it requires some code 
restructuring. The scope is larger and beyond the current bug fix.

There is no object restoring before 
MetaspaceShared::fixup_mapped_heap_regions() with the fix. To put a 
safeguard, I can place an assert in the CDS code where it materializes 
an archived object. That's the entry point to GC code.

I'll make changes and update the webrev.

Thanks,
Jiangli

>
> I spend 2 days trying to debug this issue. I would hate to have 
> someone go through this again in the future just because we have filed 
> an RFE but never get to doing it.
>
> Thanks
> - Ioi
>
>
>> Thanks,
>> Jiangli
>>
>>>
>>> Thanks
>>> - Ioi
>>>
>>>
>>> On 9/5/18 7:58 PM, Jiangli Zhou wrote:
>>>> Please review the following fix for JDK-8209971, 
>>>> TestOptionsWithRanges.java crashes in CDS mode with 
>>>> G1UpdateBufferSize=1.
>>>>
>>>>     webrev: http://cr.openjdk.java.net/~jiangli/8209971/webrev.00/
>>>>
>>>>     bug: https://bugs.openjdk.java.net/browse/JDK-8209971
>>>>
>>>> This is a pre-existing issue that's exposed when archived heap 
>>>> object relocation is enabled. At CDS runtime, the archived heap 
>>>> regions are mapped back to the runtime java heap. As the mapped 
>>>> heap data may not occupy the entire runtime GC region(s), the empty 
>>>> spaces within the regions are filled with dummy objects, which is 
>>>> done by a fixup process after mapping the data. The fixup was done 
>>>> after SystemDictionary::initialize(), because fixup needs to access 
>>>> SystemDictionary::Object_klass() (in some cases). However that is 
>>>> too late as archived java objects are accessed when 
>>>> SystemDictionary::initialize() resolves well-known classes.
>>>>
>>>> The solution is to call 
>>>> MetaspaceShared::fixup_mapped_heap_regions() at an earlier time. In 
>>>> the webre, the call is moved into 
>>>> SystemDictionary::resolve_preloaded_classes(). It's called after 
>>>> resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass)...), 
>>>> but before any of the archived java objects is accessed. No mirror 
>>>> object is access/restored during 
>>>> resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass)...). 
>>>> Archived mirrors are restored after java.lang.Class is loaded.
>>>>
>>>> Tested TestOptionsWithRanges.java locally in CDS mode. Tier1 - 
>>>> tier5 normal runs and tier1 - tier4 default CDS runs are in progress.
>>>>
>>>> Thanks,
>>>>
>>>> Jiangli
>>>>
>>>>
>>>
>>
>



More information about the hotspot-runtime-dev mailing list