RFR: 8225035: Thread stack size issue caused by large TLS size

David Holmes david.holmes at oracle.com
Wed Jun 12 07:27:07 UTC 2019


On 12/06/2019 1:55 pm, Jiangli Zhou wrote:
> Hi David,
> 
> On Tue, Jun 11, 2019 at 7:25 PM David Holmes <david.holmes at oracle.com> wrote:
>>
>> Hi Jiangli,
>>
>> On 12/06/2019 3:51 am, Jiangli Zhou wrote:
>>> Hi David,
>>>
>>> On Tue, Jun 11, 2019 at 12:56 AM David Holmes <david.holmes at oracle.com> wrote:
>>>>
>>>> Hi Jiangli,
>>>>
>>>> Some initial feedback ...
>>>>
>>>> On 11/06/2019 8:29 am, Jiangli Zhou wrote:
>>>>> Hi David and Florian,
>>>>>
>>>>> This the a follow up of the discussion regarding the TLS (Thread Local
>>>>> Storage) issue:
>>>>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2019-May/034459.html.
>>>>> The following is an updated patch with your suggestions incorporated:
>>>>>
>>>>> 1) Use __pthread_get_minstack
>>>>> 2) By default only adjust the thread stack size if the
>>>>> minstack-to-stacksize ratio is >= 10%
>>>>> 3) Introduce a new command-line option (Linux only),
>>>>> AdjustWithMinStackAt, which can be used to change the default 10%
>>>>> ratio for triggering stack size adjustment
>>>>
>>>> I don't understand why the flag range is 0.0 to 50.0. If 0.1 = 10% then
>>>> 50 = 500%, which means the minstack is 5x greater than the actual
>>>> requested stack! The flag has to be limited between 0 and 100% AFAICS
>>>> else the requested stack size is too small to accommodate the "min
>>>> stack" and the attempt to create the thread will presumably fail. ??
>>>
>>> The minstack size is added on top of the current calculated size and
>>> the resulting value is passed to pthread_attr_setstacksize before the
>>> thread creation. We will be able to create a thread with stack size
>>> that's large enough. I initially thought about using [0%, 100%] range,
>>> which is 0 to 1 as well. However, that doesn't apply to the cases
>>> where the minstack size is greater then the requested size. Not sure
>>> how rare those cases are in reality, it is fairly easy to create a
>>> test case with a minstack > request_size. In the added jtreg test, the
>>> minstack is 540k with the TLS values.
>>>
>>> I picked 50.0, but I'm not too certain about using it as the upper
>>> bound. We probably want to use a larger value to cover more cases.
>>
>> My point is that if TLS > requested stack then without enabling this
>> workaround the thread creation will presumably fail - right? And it
>> makes no sense to me to say "use the workaround if TLS is 5x greater
>> than normal stack but don't if its only 4x greater" as both cases are
>> equivalent from the perspective of "can the thread run" (answer: no!).
>>
>> I think this only needs to be a simple percentage flag: if TLS-size is >
>> X% of requested stack size then add TLS-size. So 0% would always enable
>> it and 100% would disable it until TLS-size was equal to requested stack
>> size. I don't think it make sense to consider the case where TLS >
>> requested stack size.
> 
> My reasoning was based on the ratio of
> minstack_to_stack-size-without-minstack-adjustment. We would have to
> consider the cases where the minstack > the requested size (without
> adjustment), which is > 100%. It's a bit awkward, I admit.

But it's also > 10% so will be handled by default. All that supporting > 
100% would achieve is a means to disable the workaround (up to a limit), 
which would simply result in failure.

> Using the ratio of minstack_to_total-size, we can use the simple
> percentage flag as you suggested, with 0% would always enable it and
> 100% would disable it. Sounds good.

100% doesn't disable it, it's the maximum limit for enabling.

Do we need a way to disable it? Do we need a way for someone to say "I 
never expect excessive TLS so never adjust my stacks and just let things 
fail if the TLS size is too big?" ??

>>>>
>>>>     827 static void pthread_get_minstack_init() {
>>>>     828   // The code below uses the static symbol if available.
>>>>     829   // If not, it will use dlsym to search for the symbol at runtime.
>>>>     830   p_pthread_get_minstack = &__pthread_get_minstack;
>>>>
>>>> I don't understand this part. If at runtime the "static" symbol doesn't
>>>> exist, won't this generate an error from the loader and prevent the VM
>>>> from starting?
>>>
>>> I see Florian has provided detailed information on this question (thanks!).
>>>
>>>>
>>>> Style nit: p_pthread_get_minstack
>>>>
>>>> Can you drop the "p" prefix and just use _pthread_get_minstack similar
>>>> to other dynamically determined functions please.
>>>
>>> Done.
>>>
>>>>
>>>> +     min_stack_size = (*p_pthread_get_minstack)(attr);
>>>>
>>>> When you have a function pointer you don't need to dereference it to
>>>> invoke it, you can just do:
>>>>
>>>>      min_stack_size = p_pthread_get_minstack(attr);
>>>
>>> Dropped dereferencing.
>>>
>>>>
>>>> +   if (p_pthread_get_minstack) {
>>>>
>>>> Style nit: Please avoid implicit booleans and check != NULL.
>>>
>>> Done.
>>>
>>>>
>>>> I don't quite see what the test is doing - why does the Java code run
>>>> "echo" ??
>>>
>>> The TLS issue can be easily demonstrated by starting a process using
>>> ProcessBuilder. It could be any valid command that can be passed to
>>> ProcessBuilder. "echo" is inherited from the original test case
>>> attached to JDK-8130425.
>>
>> Sorry I'm missing something here. What has an external process got to do
>> with TLS stealing stack from a newly created thread ??
> 
> Starting a ProcessBuilder causes the process reaper thread being
> created. The reaper thread can fail with StackOverflow in one of the
> failure mode  (as the originally reported issue) with certain TLS
> sizes, because it has small stack size. With JDK 13, the
> REAPER_DEFAULT_STACKSIZE is 128K. With JDK 8, it is 32K. Using the
> process reaper thread is easy to demonstrate the problem.

Ah! Thanks - yes I'd forgotten about that.

>>
>>>> I expected to see different values of AdjustWithMinStackAt
>>>> being used and threads then checking whether their stacks have been
>>>> adjusted or not; and in some cases to get errors. ??
>>>
>>> It's a bit tricky, as we may not know the exact failure behavior
>>> caused by insufficient stack space in different environments...
>>
>> Nevertheless I expect that a test in this area would be combining
>> different amounts of TLS with different values of the flag to check
>> whether thread's can be created and run. At the moment I don't get what
>> this test is actually testing - what constitutes a pass and what
>> constitutes a fail?
> 
> In the added jtreg test, it checks the output from the started
> process. The test passes when the output matches with the expected
> string (that means the process is started successfully).
> 
> The set up of the test is capable of reporting failures by verifying
> the output from the started process. The part that I'm not too sure is
> what happens if it's a hang (or other strange failure mode that I'm
> currently not aware of). I guess the test would fail with a timeout if
> it's a hang.

The tests I was envisaging actually require a means to say "don't do the 
workaround" so that we get thread creation failure if, for example, 
TLS-size > requested-stack-size.

Maybe we need a Whitebox test, or gtest, to actually verify the 
allocated stack size has been adjusted by the TLS-size? Otherwise I 
don't see how we're actually verifying things are working as expected.

Definitely tricky.

Thanks,
David

> Best regard,
> Jiangli
> 
>>
>> Thanks,
>> David
>>
>>>>
>>>>> CSR: https://bugs.openjdk.java.net/browse/JDK-8225498
>>>>
>>>> I'm still thinking about the right name, type and range for the flag,
>>>> which will impact the CSR.
>>>
>>> Thanks! Please let me know your suggestions, I'll be more than happy
>>> to incorporate them.
>>>
>>> Best regards,
>>> Jiangli
>>>
>>>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>> webrev: http://cr.openjdk.java.net/~jiangli/8225035/webrev.00/
>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8225035
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Jiangli
>>>>>


More information about the hotspot-runtime-dev mailing list