RFR: JDK-8274320: os::fork_and_exec() should be using posix_spawn

Thu Oct 7 07:32:48 UTC 2021

Hi Thomas,

On 7/10/2021 4:25 pm, Thomas Stuefe wrote:
> On Wed, 6 Oct 2021 12:48:40 GMT, David Holmes <david.holmes at oracle.com> wrote:
> 
>>>    
>>> Therefore I think we don't lose anything by moving to posix_spawn(). But we gain reliability in high-footprint scenarios.
>>
>> Sorry but I have to disagree. fork() is async-signal-safe, but if an at-fork handler is not, then all bets are off - that is fine, it is the best we can do. But posix_spawn makes no claim to any kind of async-signal safety so we very much do lose something by switching to it IMO.
>>
>> David -----
> 
> Hi David,
> 
> I looked a bit closer, since I wanted to figure out whether the async-unsafeness of calls in error reporting actually matters. Because the problem is that these functions are not re-entrant, right? But we already have an intricate mechanism in place to guard against re-entrance errors, with our secondary signal handling.

It's not the reentrancy problem I'm concerned with but simply the 
ability to "safely" call fork() from within a signal-handling context 
because it is marked as async-signal-safe (not withstanding that an 
at-fork handler may not be).

My worry is that we may hit cases where posix_spawn just causes the VM 
to hang.

I know we already take many risks in the error reporting code with 
regard to the ability to actually execute stuff from a signal handler, 
but I still feel it necessary to raise this every time something new in 
error reporting is proposed. This is stuff that only goes wrong live in 
the field after we have shipped a release.

I understand the desire to have a more reliable forking mechanism with 
respect to memory management, but that has to be weighed against other 
reliability factors.

I'm not aware of our existing use of fork() being flagged as causing 
major problems in that regard. So in my mind this change increases the 
risk of a hang, whilst "fixing" a problem that hasn't AFAIK really been 
raised as a problem.

I'm happy to hear other opinions on this.

Cheers,
David
-----

> So the first time we enter error handling, we mark this thread as the reporting one and install the secondary signal handler. All subsequent invocations open a new frame, and we skip the error reporting steps that caused the last error. So for re-entrance problems like this:
> 
> - main()
> - signal !
> - posix_spawn()
> - signal !
> - posix_spawn()
> 
> we are covered: in VMError::report_and_die(), most steps are guarded against re-execution by the step-counting-logic inside `VMError::report()` and by boolean flags inside `VMError::report_and_die()` ("log_done" and such).
> 
> (Note that this mechanism seems not well understood and bitrots: recent addition like the `Jfr::on_vm_shutdown` miss this logic. That step would be executed over and over again. That is a separate issue and should be fixed.)
> 
> Wrt to `OnError`, it is guarded against multiple executions via `skip_OnError`:
> 
> 
>    static bool skip_OnError = false;
>    if (!skip_OnError && OnError && OnError[0]) {
>      skip_OnError = true;
> 
> 
> So it won't be re-executed if a secondary signal happens inside error handling itself.
> 
> The only caveat here is that it does not guard us against problems if the non-reentrant function we call in signal handling is already atop of us on the stack:
> 
> - main()
> - posix_spawn()
> - signal !
> - posix_spawn()
> 
> But for this to happen, the signal must originate from posix_spawn itself, and be a synchronous error signal which causes us to invoke error handling. So, posix_spawn() needs to be crashy in the first place. I'd argue that the chances for this to happen are very slim, unless the libc itself is broken.
> 
> -----
> 
> fork() is async signal safe only if no atfork handlers are used. We don't know that since we share the process with other entities, including system libraries themselves. I even dimly remember reading that the glibc itself using atfork handlers for internal cleanup, but cannot come up with a prove. But using atfork handlers is a common technique used by libraries to close mutexes on fork. So the current fork() never has been completely async signal safe either.
> 
> -----
> 
> posix_spawn has the charm that it allows us to circumvent a very common problem with forking in low-memory situations. Like vfork(), but with less risk involed. We analyzed this ([1]) when @dmlloyd proposed to exchange vfork against posix_spawn in Runtime.exec(). He convinced me that this is a good idea. posix_spawn(), at least on glibc and muslc, uses `clone(CLONE_VM | CLONE_VFORK)` and mitigates the vfork-problems by starting the child off in an own stack.
> 
> So, we in exchange for a theoretical problem which I think is very narrow, we'd get reliability in common situations (VM with high footprint). I think that tradeoff makes sense.
> 
> Cheers, Thomas
> 
> [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-October/056158.html
> 
> -------------
> 
> PR: https://git.openjdk.java.net/jdk/pull/5698
>