RFR(XS) 8191369: NMT: Enhance thread stack tracking

Mon Mar 5 17:51:29 UTC 2018

Hi Zhengyu,

On Mon, Mar 5, 2018 at 6:13 PM, Zhengyu Gu <zgu at redhat.com> wrote:

> Hi Thomas,
>
>
> On 03/05/2018 11:21 AM, Thomas Stüfe wrote:
>
>> Hi Zhengyu,
>>
>> about the os_windows.cpp part:
>>
>> Are you sure this works as intended? I was not able to test it yet. It
>> may work but makes some assumptions: that the stack top you are calculating
>> is actually the end of a heap region; that you can just pass any address as
>> the first parameter to VirtualQuery (the documentation is a bit ambiguous
>> here); even that all committed pages must be consecutive, which does not
>> have to be the case (we had cases on HyperV where there were holes in the
>> committed thread stack area).
>>
>> I probably would have reused the coding in os::current_stack_base(),
>> because we know that works:
>>
>> - call VirtualQuery with an address on the stack to get
>> MEMORY_BASIC_INFORMATION::AllocationBase. We could just use stack bottom
>> here as input.
>> - then walk all regions upward and sum up the sizes of the committed
>> ones, until we hit stack top - which supposedly should coincide with the
>> end of a heap region but who knows.
>>
>
> The stack range we have here, is obtained from os::current_stack_base()
> and os::current_stack_size() during thread initialization, exactly the way
> as you mentioned above, I don't see why we can not trust it here.
>

Okay, it is fine then. Should problems show up we can also fix it later.

>
> I don't have any experiment on HyperV. If I recall correctly, on Windows,
> the stack has to be touched page by page, if you skip a page in the middle,
> you get SEGV. This may be different under Hypervisor? so we should walk
> whole range, instead of stopping at the first uncommitted region?
>
>
What we were seeing was that pages in the middle of the used stack (so,
above sp) got uncommitted when the host was under heavy load. And yes, we
crashed there. We never really solved this, but until now assume this was a
hypervisor bug - that it removes comitted pages from a guest when being
under memory pressure. When we used VirtualQuery on those missing pages,
they were indeed shown as uncomitted - in this case your tracker would stop
counting at this point, even though committed pages may follow.

My original thought was "well, if we just iterate and count all committed
regions - disregarding uncommitted holes - we are simply more correct in
the terms of this API, which wants to count committed space".   But maybe I
am overthinking, and this is just an OS bug and should not be handled in
the VM.

Thanks, Thomas

> Thanks,
>
> -Zhengyu
>
>
>
>
>
>> Best Regards, Thomas
>>
>>
>>
>>
>>
>> On Fri, Feb 23, 2018 at 5:52 PM, <coleen.phillimore at oracle.com <mailto:
>> coleen.phillimore at oracle.com>> wrote:
>>
>>
>>     This looks good but I hope someone else can review the os_linux and
>>     os_windows parts.
>>
>>     I will sponsor this once the reviews are complete.
>>
>>     thanks,
>>     Coleen
>>
>>
>>     On 2/21/18 10:49 AM, Zhengyu Gu wrote:
>>
>>         Hi,
>>
>>         I rebased the patch after JDK-8196405[1]. Stefan's patch fixed
>>         virtual memory tracking issues I tried to address in early patch.
>>
>>
>>         In additions, I removed os::pd_committed_stack_size() methods as
>>         Coleen suggested. Also added new gtest.
>>
>>         Bug: https://bugs.openjdk.java.net/browse/JDK-8191369
>>         <https://bugs.openjdk.java.net/browse/JDK-8191369>
>>         Webrev: http://cr.openjdk.java.net/~zgu/8191369/webrev.03/
>>         <http://cr.openjdk.java.net/~zgu/8191369/webrev.03/>
>>
>>         Test:
>>            hotspot_runtime, test-hotspot-gtest on Linux 64 with
>>         fastdebug and release builds.
>>
>>
>>         [1] http://cr.openjdk.java.net/~stefank/8196405/webrev.02/
>>         <http://cr.openjdk.java.net/~stefank/8196405/webrev.02/>
>>
>>         Thanks,
>>
>>         -Zhengyu
>>
>>
>>         On 02/12/2018 09:33 AM, Zhengyu Gu wrote:
>>
>>
>>
>>                 No, what you have is better.   I dislike the delegation
>>                 to the duplicate function signature in the pd_* layer.
>>               Can you make the os::committed_stack_size() functions in
>>                 the platforms where this isn't supported just be empy
>>                 functions?
>>
>>             Okay.
>>
>>
>>                 If there are platforms that don't have the ability to
>>                 track committed regions of the mtThread stack, don't you
>>                 still need the concept of all_committed?
>>
>>             No, it exists for convenience. However, stack tracking code
>>             does need adjustment.
>>
>>             Thanks,
>>
>>             -Zhengyu
>>
>>
>>
>>                 thanks,
>>                 Coleen
>>
>>
>>                     Thanks,
>>
>>                     -Zhengyu
>>
>>
>>
>>
>>
>>                         Thanks,
>>                         Coleen
>>
>>                         On 1/11/18 4:05 PM, Zhengyu Gu wrote:
>>
>>                             Hi,
>>
>>                             Can I get second (R)eviewer for this? and I
>>                             also need a sponsor.
>>
>>                             This is JDK11 bug.
>>
>>                             Webrev:
>>                             http://cr.openjdk.java.net/~zg
>> u/8191369/webrev.02/
>>                             <http://cr.openjdk.java.net/~z
>> gu/8191369/webrev.02/>
>>                             Bug: Bug:
>>                             https://bugs.openjdk.java.net/
>> browse/JDK-8191369
>>                             <https://bugs.openjdk.java.net
>> /browse/JDK-8191369>
>>
>>                             Thanks,
>>
>>                             -Zhengyu
>>
>>
>>
>>
>>                             On 11/20/2017 01:46 PM, Zhengyu Gu wrote:
>>
>>                                 Hi Andrew,
>>
>>                                 Thanks for the review.
>>
>>                                         I'm not sure about the logic of
>>                                         the test when you hit the the
>>                                         ENOMEM/uncommited case.
>>
>>                                         +    mincore_return_value =
>>                                         mincore(test_addr, page_sz, vec);
>>                                         +
>>                                         +    if (mincore_return_value ==
>>                                         -1 || (committed_only && (vec[0] &
>>                                         0x01) == 0)) {
>>                                         +      // Page is not
>>                                         mapped/committed go up
>>                                         +      // to find first
>>                                         mapped/committed page
>>                                         +      if (errno != EAGAIN ||
>>                                         (committed_only && (vec[0] &
>>                                         0x01) == 0)) {
>>                                         +
>>                assert(mincore_return_value !=
>>                                         -1 || errno == ENOMEM,
>>                                         "Unexpected mincore errno");
>>
>>                                         Should that not be
>>
>>                                             +      if
>>                                         ((mincore_return_value != -1 &&
>>                                         errno != EAGAIN) ||
>>                                         (committed_only && (vec[0] &
>>                                         0x01) == 0)) {
>>
>>
>>                                     Sorry, that correction should have
>> been:
>>
>>                                         +      if ((mincore_return_value
>>                                     == -1 && errno != EAGAIN) ||
>>                                         (committed_only && (vec[0] &
>>                                     0x01) == 0)) {
>>
>>
>>                                 Thanks for pointing out. Updated
>>                                 accordingly:
>>
>>                                 Webrev:
>>                                 http://cr.openjdk.java.net/~zg
>> u/8191369/webrev.02/
>>                                 <http://cr.openjdk.java.net/~z
>> gu/8191369/webrev.02/>
>>
>>                                 -Zhengyu
>>
>>
>>                                         A successful call to mincore is
>>                                         not guaranteed to reset errno and
>> it
>>                                         might by chance already have
>>                                         value EAGAIN before the call to
>>                                         mincore. If
>>                                         we happened to hit a mapped,
>>                                         committed page first time then
>>                                         the vec bit
>>                                         will be 1 and the code will loop
>>                                         and retest the same address.
>>                                         Unlikely
>>                                         but possible?
>>
>>
>>                                         The adjustment to the edge case
>>                                         logic now looks correct.
>>
>>                                             The test case for verifying
>>                                             fixes for above two issues
>>                                             can be found
>>                                             here:
>>
>> http://cr.openjdk.java.net/~zgu/8191369/test_mmap.c
>>                                             <
>> http://cr.openjdk.java.net/~zgu/8191369/test_mmap.c>
>>
>>
>>                                         Yes, that verifies the edge case
>>                                         handling e.g. I ran it with 256
>>                                         pages
>>                                         and with 127/8/9 mapped and it
>>                                         gave the right results.
>>
>>                                             As mentioned in early email,
>>                                             new patch also contains
>>                                             implementation for
>>                                             Windows.
>>
>>                                             Updated Webrev:
>>
>> http://cr.openjdk.java.net/~zgu/8191369/webrev.01/
>>                                             <
>> http://cr.openjdk.java.net/~zgu/8191369/webrev.01/>
>>
>>                                             Test:
>>
>>                                                 Reran hotspot_tier1
>>                                             tests on Windows x64 and
>>                                             Linux x64 (fastdebug and
>>                                             release)
>>
>>                                         I have no idea about the
>>                                         validity of the Windows fix but
>>                                         the Linux one
>>                                         looks good modulo that if
>> condition.
>>
>>                                         regards,
>>
>>
>>                                         Andrew Dinn
>>                                         -----------
>>                                         Senior Principal Software Engineer
>>                                         Red Hat UK Ltd
>>                                         Registered in England and Wales
>>                                         under Company Registration No.
>>                                         03798903
>>                                         Directors: Michael Cunningham,
>>                                         Michael ("Mike") O'Neill, Eric
>>                                         Shander
>>
>>
>>
>>
>>
>>
>>