回复：Re: A solution to detect the footprint of Java thread stack

Wed Dec 16 03:30:54 UTC 2020

Hi Tomas,

Thank you for your comment. 

> I may misunderstand the proposal, but is this a one-shot solution? So the first time we reach the warning zone we get a warning, zone gets disabled, never gets enabled again?

We will disable the pages between the accessing address and the warning zone base, not the whole warning zone, this disabled zone will not become enabled again.

> Looking from outside at all threads, some will have their warning zone disabled, some not. We have the information that a thread reached the warning level, but not when and what it did then, no? It could have been a short spike in the deep past, never to reoccur, or it could be a permanent high stack usage. I think especially for long-running worker threads which have many different workloads this may limit its usefulness.

You inspire me. Warning messages should not only contain thread id, current footprint, and stack size but also contain thread stack, this is necessary since users would have more information. As for a temporary short spike, this is indeed possible. In my original intention, the user observes warning messages in recent iterations of the application, and then judge whether to adjust -Xss in the next iteration.

Thanks,
Yang Yi
------------------------------------------------------------------
发件人：Thomas Stüfe<thomas.stuefe at gmail.com>
日　期：2020年12月15日 15:24:49
收件人：Yang Yi<qingfeng.yy at alibaba-inc.com>
抄　送：HotSpot Open Source Developers<hotspot-dev at openjdk.java.net>; hotspot-jfr-dev<hotspot-jfr-dev at openjdk.java.net>
主　题：Re: A solution to detect the footprint of Java thread stack

Hi Yang Yi,

it sounds interesting, but I am not yet sure how useful this will be in the proposed form.

I may misunderstand the proposal, but is this a one-shot solution? So the first time we reach the warning zone we get a warning, zone gets disabled, never gets enabled again?

We have that warning in UL and JFR, but we do not have any context. Looking from outside at all threads, some will have their warning zone disabled, some not. We have the information that a thread reached the warning level, but not when and what it did then, no? It could have been a short spike in the deep past, never to reoccur, or it could be a permanent high stack usage. I think especially for long-running worker threads which have many different workloads this may limit its usefulness.

If it is only to measure stack sizes, would a solution in JFR which iterates the stacks of threads periodically and measures the real used stack (as NMT does it, using mincore()) not better? That would be somewhat simpler to implement, would not be a one-shot solution, would not be bound to one warning level, would get more precise information, and would not rely on page granularity.

About that, the warning zone may increase the stack size for platforms with larger base page sizes (eg PPC) where we have 64K pages.

Cheers, Thomas

On Mon, Dec 14, 2020 at 6:03 AM 杨易(青风) <qingfeng.yy at alibaba-inc.com> wrote:
Hi all,
 We occasionally get stack overflow errors if the application has too many deep callings. The usual solution is to increase -Xss, but there is no guiding opinion telling us in advance that the thread stack space is not enough. This patch attempts to address this issue. It introduces a new stack zone(warning_zone) for Hotspot VM, which can be used to detect whether the footprint ratio of the thread stack has reached the watermark set by the user. 

 The current stack layout is as follows:
 stack_base                                                                                    stack_end
 ---------------------------------------------------------------------------------------------------------------------------
 |        |   shadow zone |                   warning zone(added)                   | reserved zone | yellow zone | red zone |
 ---------------------------------------------------------------------------------------------------------------------------
          ^rsp                   |<-StackWarningRatio % of available stack->

 Users can set the option -XX:StackWarningRatio=0 to turn off this feature(which is also the default value) and there will be no warning zone, the stack layout is the same as usual. 
 If the user sets -XX:StackWarningRatio to 50, the VM will protect 50% of the available stack space(let all = available+reserved+yellow+red;) as stack warning zone. When the footprint of available stack space reaches 50%, that is, when the application accesses the stack warning zone, the sigsegv signal will be triggered, and then the signal handler will warn this access in some ways(e.g. send the JFR event, logging) and resume current execution. As a prototype, this feature is only implemented on x86_posix, it can be implemented on more platforms as needed.

 Could anyone give some suggestions? If you think this is a useful feature, I will file an issue and support it on all platforms. Looking forward to your comments.

 Patch: https://github.com/kelthuzadx/jdk/commit/a145b2ebb4ac568e3bb090cbe3e7f091e1dc69ea
 Diff: https://github.com/kelthuzadx/jdk/compare/37dc675..a145b2e

 Best,
 Yang Yi