Exponentially delay subsequent native thread creation in case of EAGAIN
David Holmes
david.holmes at oracle.com
Tue Apr 15 03:07:30 UTC 2025
Hi Yannik,
On 15/04/2025 2:22 am, Yannik Stradmann wrote:
> Hello everyone,
>
> I'd like to propose a change to hotspot's error handling when spawning native threads in os::create_thread().
>
> Currently, if EAGAIN is encountered, we retry three times back-to-back.
>
> During recent years, I've experienced instabilities on certain systems, where back-to-back (re-)requests of native threads kept hitting the depleted resource pool and, eventually, failed.
>
> I therefore propose to introduce an exponential backoff when hitting EAGAIN during native thread creation. Hotspot will thereby be more kind to an already depleted resource, reduce stress on the kernel and become more robust on systems under high load.
>
> For reference, I am attaching a patch against os_linux.cpp, which has been running in production on a mid-scale Jenkins cluster over the past three years. If you approve the modification, I'm happy to create a pull request that includes the other platforms (where applicable).
> The current choice of constants is arbitrary and I'd welcome any suggestions here.
This is not an unreasonable idea. But it is very hard to evaluate the
effectiveness of such a change. Do you have any actual data on how many
retries you have had to wait to succeed?
When the retries were added in:
https://bugs.openjdk.org/browse/JDK-8268773
there was some discussion across a number of bug reports and two PRs
about the potential usefulness of even doing a basic retry as the error
condition was considered to unlikely to be self correcting. But as per
that original change, adding a delay between retries does no harm other
than delaying the ultimate reporting of an error, so it may be okay to
put in place if it will do some good.
I've filed an enhancement request on your behalf:
https://bugs.openjdk.org/browse/JDK-8354560
> Please note that this is my first time contributing to OpenJDK, please excuse potential unfamiliarities with the process.
Please see the following:
https://openjdk.org/guide/
Thanks,
David
>
> Yannik
>
>
> diff --git a/src/hotspot/os/linux/os_linux.cpp b/src/hotspot/os/linux/os_linux.cpp
> index 4e26797cd5b..2858fbba247 100644
> --- a/src/hotspot/os/linux/os_linux.cpp
> +++ b/src/hotspot/os/linux/os_linux.cpp
> @@ -1064,10 +1064,28 @@ bool os::create_thread(Thread* thread, ThreadType thr_type,
> ResourceMark rm;
> pthread_t tid;
> int ret = 0;
> - int limit = 3;
> - do {
> + int limit = 5;
> + useconds_t delay = 1'000;
> + constexpr useconds_t max_delay = 1'000'000;
> +
> + while (true) {
> ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread);
> - } while (ret == EAGAIN && limit-- > 0);
> +
> + if (ret != EAGAIN) {
> + break;
> + }
> +
> + if (limit-- <= 0) {
> + break;
> + }
> +
> + log_warning(os, thread)("Failed to start native thread (%s), retrying after %dus.", os::errno_name(ret), delay);
> + ::usleep(delay);
> + delay *= 2;
> + if (delay > max_delay) {
> + delay = max_delay;
> + }
> + }
>
> char buf[64];
> if (ret == 0) {
More information about the hotspot-runtime-dev
mailing list