RFR: JDK-8274320: os::fork_and_exec() should be using posix_spawn

Thu Oct 7 06:25:06 UTC 2021

On Wed, 6 Oct 2021 12:48:40 GMT, David Holmes <david.holmes at oracle.com> wrote:

> >   
> > Therefore I think we don't lose anything by moving to posix_spawn(). But we gain reliability in high-footprint scenarios.
> 
> Sorry but I have to disagree. fork() is async-signal-safe, but if an at-fork handler is not, then all bets are off - that is fine, it is the best we can do. But posix_spawn makes no claim to any kind of async-signal safety so we very much do lose something by switching to it IMO.
> 
> David -----

Hi David,

I looked a bit closer, since I wanted to figure out whether the async-unsafeness of calls in error reporting actually matters. Because the problem is that these functions are not re-entrant, right? But we already have an intricate mechanism in place to guard against re-entrance errors, with our secondary signal handling.

So the first time we enter error handling, we mark this thread as the reporting one and install the secondary signal handler. All subsequent invocations open a new frame, and we skip the error reporting steps that caused the last error. So for re-entrance problems like this:

- main()
- signal !
- posix_spawn()
- signal !
- posix_spawn()

we are covered: in VMError::report_and_die(), most steps are guarded against re-execution by the step-counting-logic inside `VMError::report()` and by boolean flags inside `VMError::report_and_die()` ("log_done" and such).

(Note that this mechanism seems not well understood and bitrots: recent addition like the `Jfr::on_vm_shutdown` miss this logic. That step would be executed over and over again. That is a separate issue and should be fixed.)

Wrt to `OnError`, it is guarded against multiple executions via `skip_OnError`:

  static bool skip_OnError = false;
  if (!skip_OnError && OnError && OnError[0]) {
    skip_OnError = true;

So it won't be re-executed if a secondary signal happens inside error handling itself.

The only caveat here is that it does not guard us against problems if the non-reentrant function we call in signal handling is already atop of us on the stack:

- main()
- posix_spawn()
- signal !
- posix_spawn()

But for this to happen, the signal must originate from posix_spawn itself, and be a synchronous error signal which causes us to invoke error handling. So, posix_spawn() needs to be crashy in the first place. I'd argue that the chances for this to happen are very slim, unless the libc itself is broken.

-----

fork() is async signal safe only if no atfork handlers are used. We don't know that since we share the process with other entities, including system libraries themselves. I even dimly remember reading that the glibc itself using atfork handlers for internal cleanup, but cannot come up with a prove. But using atfork handlers is a common technique used by libraries to close mutexes on fork. So the current fork() never has been completely async signal safe either.

-----

posix_spawn has the charm that it allows us to circumvent a very common problem with forking in low-memory situations. Like vfork(), but with less risk involed. We analyzed this ([1]) when @dmlloyd proposed to exchange vfork against posix_spawn in Runtime.exec(). He convinced me that this is a good idea. posix_spawn(), at least on glibc and muslc, uses `clone(CLONE_VM | CLONE_VFORK)` and mitigates the vfork-problems by starting the child off in an own stack.

So, we in exchange for a theoretical problem which I think is very narrow, we'd get reliability in common situations (VM with high footprint). I think that tradeoff makes sense.

Cheers, Thomas

[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-October/056158.html

-------------

PR: https://git.openjdk.java.net/jdk/pull/5698