RFR: 8225035: Thread stack size issue caused by large TLS size

Jiangli Zhou jianglizhou at google.com
Sat Jun 15 17:24:16 UTC 2019


Hi David,

Sorry for the delay. Thanks a lot for the continued discussions in the
thread and addressing some of the questions came up in the mailing list.
Really appreciate it!

On Wed, Jun 12, 2019 at 12:27 AM David Holmes <david.holmes at oracle.com>
wrote:

> On 12/06/2019 1:55 pm, Jiangli Zhou wrote:
> > Hi David,
> >
> > On Tue, Jun 11, 2019 at 7:25 PM David Holmes <david.holmes at oracle.com>
> wrote:
> >>
> >> Hi Jiangli,
> >>
> >> On 12/06/2019 3:51 am, Jiangli Zhou wrote:
> >>> Hi David,
> >>>
> >>> On Tue, Jun 11, 2019 at 12:56 AM David Holmes <david.holmes at oracle.com>
> wrote:
> >>>>
> >>>> Hi Jiangli,
> >>>>
> >>>> Some initial feedback ...
> >>>>
> >>>> On 11/06/2019 8:29 am, Jiangli Zhou wrote:
> >>>>> Hi David and Florian,
> >>>>>
> >>>>> This the a follow up of the discussion regarding the TLS (Thread
> Local
> >>>>> Storage) issue:
> >>>>>
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2019-May/034459.html
> .
> >>>>> The following is an updated patch with your suggestions incorporated:
> >>>>>
> >>>>> 1) Use __pthread_get_minstack
> >>>>> 2) By default only adjust the thread stack size if the
> >>>>> minstack-to-stacksize ratio is >= 10%
> >>>>> 3) Introduce a new command-line option (Linux only),
> >>>>> AdjustWithMinStackAt, which can be used to change the default 10%
> >>>>> ratio for triggering stack size adjustment
> >>>>
> >>>> I don't understand why the flag range is 0.0 to 50.0. If 0.1 = 10%
> then
> >>>> 50 = 500%, which means the minstack is 5x greater than the actual
> >>>> requested stack! The flag has to be limited between 0 and 100% AFAICS
> >>>> else the requested stack size is too small to accommodate the "min
> >>>> stack" and the attempt to create the thread will presumably fail. ??
> >>>
> >>> The minstack size is added on top of the current calculated size and
> >>> the resulting value is passed to pthread_attr_setstacksize before the
> >>> thread creation. We will be able to create a thread with stack size
> >>> that's large enough. I initially thought about using [0%, 100%] range,
> >>> which is 0 to 1 as well. However, that doesn't apply to the cases
> >>> where the minstack size is greater then the requested size. Not sure
> >>> how rare those cases are in reality, it is fairly easy to create a
> >>> test case with a minstack > request_size. In the added jtreg test, the
> >>> minstack is 540k with the TLS values.
> >>>
> >>> I picked 50.0, but I'm not too certain about using it as the upper
> >>> bound. We probably want to use a larger value to cover more cases.
> >>
> >> My point is that if TLS > requested stack then without enabling this
> >> workaround the thread creation will presumably fail - right? And it
> >> makes no sense to me to say "use the workaround if TLS is 5x greater
> >> than normal stack but don't if its only 4x greater" as both cases are
> >> equivalent from the perspective of "can the thread run" (answer: no!).
> >>
> >> I think this only needs to be a simple percentage flag: if TLS-size is >
> >> X% of requested stack size then add TLS-size. So 0% would always enable
> >> it and 100% would disable it until TLS-size was equal to requested stack
> >> size. I don't think it make sense to consider the case where TLS >
> >> requested stack size.
> >
> > My reasoning was based on the ratio of
> > minstack_to_stack-size-without-minstack-adjustment. We would have to
> > consider the cases where the minstack > the requested size (without
> > adjustment), which is > 100%. It's a bit awkward, I admit.
>
> But it's also > 10% so will be handled by default. All that supporting >
> 100% would achieve is a means to disable the workaround (up to a limit),
> which would simply result in failure.
>

Yes, I was considering the cases for disabling the workaround. Mainly for
testing the current patch.

I agree, with a large TLS size (larger than the requested stack size),
disabling it is not meaningful.


>
> > Using the ratio of minstack_to_total-size, we can use the simple
> > percentage flag as you suggested, with 0% would always enable it and
> > 100% would disable it. Sounds good.
>
> 100% doesn't disable it, it's the maximum limit for enabling.
>

100% would diable it unless the requested stack size is 0. However, my
point about disabling the workaround is moot.


>
> Do we need a way to disable it? Do we need a way for someone to say "I
> never expect excessive TLS so never adjust my stacks and just let things
> fail if the TLS size is too big?" ??
>
> >>>>
> >>>>     827 static void pthread_get_minstack_init() {
> >>>>     828   // The code below uses the static symbol if available.
> >>>>     829   // If not, it will use dlsym to search for the symbol at
> runtime.
> >>>>     830   p_pthread_get_minstack = &__pthread_get_minstack;
> >>>>
> >>>> I don't understand this part. If at runtime the "static" symbol
> doesn't
> >>>> exist, won't this generate an error from the loader and prevent the VM
> >>>> from starting?
> >>>
> >>> I see Florian has provided detailed information on this question
> (thanks!).
> >>>
> >>>>
> >>>> Style nit: p_pthread_get_minstack
> >>>>
> >>>> Can you drop the "p" prefix and just use _pthread_get_minstack similar
> >>>> to other dynamically determined functions please.
> >>>
> >>> Done.
> >>>
> >>>>
> >>>> +     min_stack_size = (*p_pthread_get_minstack)(attr);
> >>>>
> >>>> When you have a function pointer you don't need to dereference it to
> >>>> invoke it, you can just do:
> >>>>
> >>>>      min_stack_size = p_pthread_get_minstack(attr);
> >>>
> >>> Dropped dereferencing.
> >>>
> >>>>
> >>>> +   if (p_pthread_get_minstack) {
> >>>>
> >>>> Style nit: Please avoid implicit booleans and check != NULL.
> >>>
> >>> Done.
> >>>
> >>>>
> >>>> I don't quite see what the test is doing - why does the Java code run
> >>>> "echo" ??
> >>>
> >>> The TLS issue can be easily demonstrated by starting a process using
> >>> ProcessBuilder. It could be any valid command that can be passed to
> >>> ProcessBuilder. "echo" is inherited from the original test case
> >>> attached to JDK-8130425.
> >>
> >> Sorry I'm missing something here. What has an external process got to do
> >> with TLS stealing stack from a newly created thread ??
> >
> > Starting a ProcessBuilder causes the process reaper thread being
> > created. The reaper thread can fail with StackOverflow in one of the
> > failure mode  (as the originally reported issue) with certain TLS
> > sizes, because it has small stack size. With JDK 13, the
> > REAPER_DEFAULT_STACKSIZE is 128K. With JDK 8, it is 32K. Using the
> > process reaper thread is easy to demonstrate the problem.
>
> Ah! Thanks - yes I'd forgotten about that.
>

That reminded me that I should add detailed comments in the test regarding
ProcessBuilder and reaper thread. That would be beneficial in the future.
Thanks!


> >>
> >>>> I expected to see different values of AdjustWithMinStackAt
> >>>> being used and threads then checking whether their stacks have been
> >>>> adjusted or not; and in some cases to get errors. ??
> >>>
> >>> It's a bit tricky, as we may not know the exact failure behavior
> >>> caused by insufficient stack space in different environments...
> >>
> >> Nevertheless I expect that a test in this area would be combining
> >> different amounts of TLS with different values of the flag to check
> >> whether thread's can be created and run. At the moment I don't get what
> >> this test is actually testing - what constitutes a pass and what
> >> constitutes a fail?
> >
> > In the added jtreg test, it checks the output from the started
> > process. The test passes when the output matches with the expected
> > string (that means the process is started successfully).
> >
> > The set up of the test is capable of reporting failures by verifying
> > the output from the started process. The part that I'm not too sure is
> > what happens if it's a hang (or other strange failure mode that I'm
> > currently not aware of). I guess the test would fail with a timeout if
> > it's a hang.
>
> The tests I was envisaging actually require a means to say "don't do the
> workaround" so that we get thread creation failure if, for example,
> TLS-size > requested-stack-size.
>

That's actually the reason why the updated patch used [0, 50] range (I
manually set the ratio to test the patch). For the test case with 128K int
__thread variables, setting a ratio that's > 4 (400%) would disable the
stack size adjustment and demonstrate the failure.


> Maybe we need a Whitebox test, or gtest, to actually verify the
> allocated stack size has been adjusted by the TLS-size? Otherwise I
> don't see how we're actually verifying things are working as expected.
>
> Definitely tricky.
>

Maybe adding a logging message and verify that in the test?

Best regards,

Jiangli


>
> Thanks,
> David
>
> > Best regard,
> > Jiangli
> >
> >>
> >> Thanks,
> >> David
> >>
> >>>>
> >>>>> CSR: https://bugs.openjdk.java.net/browse/JDK-8225498
> >>>>
> >>>> I'm still thinking about the right name, type and range for the flag,
> >>>> which will impact the CSR.
> >>>
> >>> Thanks! Please let me know your suggestions, I'll be more than happy
> >>> to incorporate them.
> >>>
> >>> Best regards,
> >>> Jiangli
> >>>
> >>>>
> >>>> Thanks,
> >>>> David
> >>>> -----
> >>>>
> >>>>> webrev: http://cr.openjdk.java.net/~jiangli/8225035/webrev.00/
> >>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8225035
> >>>>>
> >>>>> Best regards,
> >>>>>
> >>>>> Jiangli
> >>>>>
>


More information about the hotspot-runtime-dev mailing list