RFR (preliminary): 8202772: "NMT erroneously assumes thread stack boundaries to be page aligned"

Thu Jun 7 14:52:19 UTC 2018

> 
> 3129     // During shutdown, some memory goes away without properly
> notifying NMT,
> 3130     // E.g. ConcurrentGCThread/WatcherThread can exit without
> deleting thread object.
> 3131     // Bailout and return as not committed for now.
> 3132     if (mincore_return_value == -1 && errno == ENOMEM) {
> 3133       return false;
> 3134     }
> 
> here, if the reserved region spans multiple stripes, we would cancel
> the mincore loop the moment we encounter a completely empty
> (non-residental) page stripe? Is this intented? I think that would
> lead to partly committed regions to be reported errornously as not
> having any committed parts.
> 
> 3137     // Process this stripe
> 3138     for (int vecIdx = 0; vecIdx < pages_to_query; vecIdx ++) {
> 3139       if ((vec[vecIdx] & 0x01) == 0) { // not committed
> 3140         // End of current contiguous region
> 3141         if (committed_start != NULL) {
> 3142           break;
> 3143         }
> 3144       } else { // committed
> 
> Here, at line 3142, to my understanding we have found a complete
> contiguous range of residential pages. That we should report, right?
> So we should return.
> 
> However we dont, since the break at 3142 will just break out of the
> inner for loop at 3138, not the outer mincore-stripe-loop at 3121.
> Which means for multiple stripes, we would now continue with the next
> page stripe and loose the information about the contiguous memory
> range encountered.
> 
> I might be wrong... what do you think?
Yes, you are right! I filed:

https://bugs.openjdk.java.net/browse/JDK-8204557

Thanks,

-Zhengyu


> 
> ..Thomas
>>
>> Thanks,
>>
>> -Zhengyu
>>
>>
>>>
>>> My first attempt at fixing this (see above webrev) was to feed NMT a
>>> corrected version of the thread stack size - just the page-aligned
>>> inner portion of the stack - that way we loose a bit fidelity in NMT
>>> thread stack accounting, but at least we do not crash. That makes
>>> runtime errors go away, but there is a gtest which stubbornly refuses
>>> to heal.
>>>
>>> See CommittedVirtualMemoryTest
>>> (test/hotspot/gtest/runtime/test_committed_virtualmemory.cpp): I admit
>>> I do not fully understand this test. It seems to record the current
>>> threads stack base and size - ok - and then query the virtual regions
>>> as perceived by NMT, expecting that the stack top is at the end of a
>>> committed region. But even without the matter of unaligned stack base,
>>> could it not be that virtual regions in NMT are fused, e.g. if
>>> multiple thread stacks are placed next to each other? So, I am not
>>> sure the test if correct.
>>>
>>> Would be nice if someone with more NMT knowledge could comment.
>>>
>>> --
>>>
>>> Please note: Since I do most of my development on Linux, I modified
>>> the stack base in the preliminary patch a bit to emulate the same
>>> error on Linux I get on AIX. Because AIX is a terrible platform to
>>> debug on :)
>>>
>>> Note that the VM usually is fine with unaligned stack bases - NMT is
>>> the only part I know of which has problems with that.
>>>
>>> --
>>>
>>> Thanks a lot,
>>>
>>> Best Regards, Thomas
>>>
>>