RFR: 8209971: TestOptionsWithRanges.java crashes in CDS mode with G1UpdateBufferSize=1

Ioi Lam ioi.lam at oracle.com
Thu Sep 6 07:04:18 UTC 2018



On 9/5/18 10:10 PM, Jiangli Zhou wrote:
> Hi Ioi,
>
> Thanks for the quick review.
>
> On 9/5/18 8:46 PM, Ioi Lam wrote:
>> Hi Jiangli,
>>
>> Thanks for fixing this! I like the simple approach that minimizes 
>> code change.
>>
>> The following comment is misleading. It suggest that mirror objects 
>> cannot be restored at this stage, but doesn't explain why.
>>
>> Also, I don't think we should single out constant pools (the crash 
>> call stack had nothing to do with constant pool.)
>>
>> 2039     // No mirror object is access/restored during
>> 2040     // 
>> resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass)...).
>> 2041     // Mirrors are restored after java.lang.Class is loaded.
>> 2042     //
>> 2043     // Object_klass()->constants()->restore_unshareable_info() 
>> restores the
>> 2044     // constant pool resolved_references array for Object klass. 
>> It must be
>> 2045     // done after MetaspaceShared::fixup_mapped_heap_regions().
> All constant pool resolved_references arrays for shared classes are 
> archived (like the mirrors). Restoration of the resolved_references 
> (of the Object klass) array object is done as part of the 
> Object_klass()->constants()->restore_unshareable_info(). Although it 
> wasn't shown in the backtrace of this specific crash, it can be an 
> issue as long it is done before the mapped archive regions are fixed 
> up. That's why MetaspaceShared::fixup_mapped_heap_regions() needs to 
> be called before 
> Object_klass()->constants()->restore_unshareable_info(). Any operation 
> to the archived java objects can be potentially unsafe if it is done 
> before MetaspaceShared::fixup_mapped_heap_regions(). So we want to 
> place the MetaspaceShared::fixup_mapped_heap_regions() call as early 
> as possible but after SystemDictionary::Object_klass() is set up. 
> Please let me know if adding some of these explanations would help 
> make the comments more clear.
>
We should have concise comments that clearly describes the issues, and 
avoid repeating things that's already in the code. So sentences like 
"Fixup the mapped archive heap regions after resolving Object_klass" 
doesn't seem that useful.

How about this:

     resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass), scan, 
CHECK);
     // It's unsafe to access the archived heap regions before they
     // are fixed up, so we must do the fixup as early as possible before
     // the archived java objects are accessedby function such as
     // java_lang_Class::restore_archived_mirror
     // and ConstantPool::restore_unshareable_info.
     //
     // However, MetaspaceShared::fixup_mapped_heap_regions() fills the
     // leading empty spaces in the archived regions and may use
     // SystemDictionary::Object_klass(), so we can do this only after
     // Object_klass is resolved.
     //
     // Note: no mirror objects are accessed/restored in the above call.
     // Mirrors are restored after java.lang.Class is loaded.
     MetaspaceShared::fixup_mapped_heap_regions();


BTW, while you're fixing this area, I think we need an assert here:

void ConstantPool::restore_unshareable_info(TRAPS) {
   ..
   if (SystemDictionary::Object_klass_loaded()) {
     ...
   } else {
+  assert(pool_holder()->name() == vmSymbols::java_lang_Object(), "must 
be");
    }

That's because you have special handling for the Object klass, but no 
other klasses. However, the check in 
ConstantPool::restore_unshareable_info currently doesn't explicitly 
match this.

     // Initialize the constant pool for the Object_class
Object_klass()->constants()->restore_unshareable_info(CHECK);


> Just to give some background/history information to help understand. 
> The mapping/fixing-up of the archive heap region was designed and 
> implemented as part of the shared String support. No archived String 
> objects were accessed before the region fixup operation. So this was 
> not an issue. Later on, open archive heap region was introduced and 
> archiving supports for resolved_reference arrays and mirrors were 
> implemented. That introduced the hidden issue because some of the 
> archived mirrors and resolved_references arrays were accessed early 
> (before region fixup). The issue was undiscovered because the mapped 
> region start usually was the same as the start of the GC region that 
> contains the mapped data. There was no gap at the top of the mapped GC 
> regions in normal cases. We recently started to stress testing this 
> area heavily with relocations for the default CDS archive and the 
> issue then surfaced.
>
>>
>> The real reason we have to do this is:
>>
>> + Before MetaspaceShared::fixup_mapped_heap_regions() is called, the 
>> heap is in an inconsistent state. At some G1 regions, the bottom part 
>> contains all zeros. However, some GC operations (debug-build 
>> verifications only?) would need to walk G1 regions starting from the 
>> bottom, and will run into asserts when we get an uninitialized block.
> Just want to make it clear, only the mapped archived heap regions may 
> have this issue when the mapped region start is different from the 
> containing GC region's start. None of the regular GC regions have such 
> issue.
>
>>
>> + One of such operation is the barrier operations, as shown in the 
>> crash call stack in the bug report. (Maybe there are others??)
> Any operation to the archived java objects could have issue if it is 
> done before archive region fixup. That's why we want to do the archive 
> region fixup as early as possible.
>>
>> For safety, we should put in asserts such that -- before 
>> MetaspaceShared::fixup_mapped_heap_regions() is done, we must not do 
>> anything that would trigger barrier operations. For example, we must 
>> not call oopDesc::obj_field_put, which triggered the barrier 
>> operations in the crash stack.
>>
>> The crash is timing related and happened only when G1UpdateBufferSize 
>> was set to one. I think we need asserts so we can reliably catch any 
>> future problems. For example, if someone adds new code to modify obj 
>> fields in the archived heap before fixup_mapped_heap_regions() is 
>> called, we should get an assert immediately, instead of waiting for 
>> some future point that may or may not happen.
>>
>> I would suggest working with the GC folks for a suitable assert.
> Not sure if there is a central place within GC code can be used for 
> the assertion, as any operation could be an issue. We can check with 
> the GC experts.
>
> If we want to be extra cautious, we could map the archive heap regions 
> without access permission (PROT_NONE) initially. The archive region 
> fixup code can then reset memory protection and makes it 
> readable/writable. With that, any operations to the archived java 
> objects before fixup can trigger signal. This probably should be done 
> for debugging only build and should be handled separately.
>

Making the memory read-only initially seems a very good idea. I don't 
think we should put it off in a future RFE. This bug happened because we 
don't have a safeguard mechanism, so the bug fix should include such a 
mechanism.

I spend 2 days trying to debug this issue. I would hate to have someone 
go through this again in the future just because we have filed an RFE 
but never get to doing it.

Thanks
- Ioi


> Thanks,
> Jiangli
>
>>
>> Thanks
>> - Ioi
>>
>>
>> On 9/5/18 7:58 PM, Jiangli Zhou wrote:
>>> Please review the following fix for JDK-8209971, 
>>> TestOptionsWithRanges.java crashes in CDS mode with 
>>> G1UpdateBufferSize=1.
>>>
>>>     webrev: http://cr.openjdk.java.net/~jiangli/8209971/webrev.00/
>>>
>>>     bug: https://bugs.openjdk.java.net/browse/JDK-8209971
>>>
>>> This is a pre-existing issue that's exposed when archived heap 
>>> object relocation is enabled. At CDS runtime, the archived heap 
>>> regions are mapped back to the runtime java heap. As the mapped heap 
>>> data may not occupy the entire runtime GC region(s), the empty 
>>> spaces within the regions are filled with dummy objects, which is 
>>> done by a fixup process after mapping the data. The fixup was done 
>>> after SystemDictionary::initialize(), because fixup needs to access 
>>> SystemDictionary::Object_klass() (in some cases). However that is 
>>> too late as archived java objects are accessed when 
>>> SystemDictionary::initialize() resolves well-known classes.
>>>
>>> The solution is to call MetaspaceShared::fixup_mapped_heap_regions() 
>>> at an earlier time. In the webre, the call is moved into 
>>> SystemDictionary::resolve_preloaded_classes(). It's called after 
>>> resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass)...), but 
>>> before any of the archived java objects is accessed. No mirror 
>>> object is access/restored during 
>>> resolve_wk_klasses_through(WK_KLASS_ENUM_NAME(Object_klass)...). 
>>> Archived mirrors are restored after java.lang.Class is loaded.
>>>
>>> Tested TestOptionsWithRanges.java locally in CDS mode. Tier1 - tier5 
>>> normal runs and tier1 - tier4 default CDS runs are in progress.
>>>
>>> Thanks,
>>>
>>> Jiangli
>>>
>>>
>>
>



More information about the hotspot-runtime-dev mailing list