Hi Florian, our mails crossed... I think I am fine now with posix_spawn(), provided we do enough testing. But I'll answer your questions inline. On Mon, Oct 22, 2018 at 9:00 PM Florian Weimer <fweimer@redhat.com> wrote:
* Thomas Stüfe:
So far I have not read a single technical reason in this thread why vfork needs to be abandoned now - apart from it being obsolete. If you read my initial thread from September, you know that I think we have understood vfork's shortcomings very well, and that our (SAPs) proposed patch shows that they can be dealt with. In our port, our vfork+exec*2 is solid since many years, without any issues.
The main problem for vfork in application code is that you need to *all* disable signals, even signals used by the implementation. If a signal handler runs by accident while the vfork is active, memory corruption is practically guaranteed. The only way to disable the signals is with a direct system call; sigprocmask/pthread_sigmask do not work.
Does your implementation do this?
I understand. No, admittedly not. But we squeeze the vulnerable time window to the minimal possible: if (vfork() == 0) exec(..); which was a large step forward from the stock Ojdk solution. While not completely bullet proof, I saw not a single instance of an error in all these years (I understand those errors would be very intermittent and difficult to attribute to vfork+signalling, so we may have missed some).
The current posix_spawn() implementation was added to glibc with glibc 2.24. So, what was the state of posix_spawn() before that version? Is it safe to use, does it do the right thing, or will we encounter regressions?
It uses fork by default. It can be told to use vfork, via POSIX_SPAWN_USEVFORK, but then it is buggy. For generic JDK code, this seems hardly appropriate.
Are you sure about this? The coding I saw in glibc < 2.24 was that it would use vfork if both attributes and file actions were NULL, which should be the case with the OpenJDK and jspawnhelper. fork() would be bad and a reason not to use posix_spawn().
My Ubuntu 16.04 box runs glibc 2.23. Arguably, Ubuntu 16.04 is quite a common distro. I have to check our machines at work, but I am very sure that our zoo of SLES and RHEL servers do not all run glibc>=2.24, especially on the more exotic architectures.
In glibc, the vfork-based performance does not bring in any new ABIs, so it is in theory backportable. The main risk is that the vfork optimization landed in glibc 2.24, and the PID cache was removed in glibc 2.25. vfork with the PID cache was really iffy, but I would not recommend to backport the PID cache removal. But Debian 9/stretch uses glibc 2.24, and I think that shows that the vfork optimization with the PID cache should be safe enough. (Of course you need to remove the assert that fires if the vfork does not actually stop the parent process and is implemented as a fork; the glibc implementation still works, but with somewhat degraded error checking.)
How far back would you want to see this changed? Debian jessie and Red Hat Enterprise Linux 6 would be rather unlikely. If you want to target those, your only chance is to essentially duplicate the glibc implementation in OpenJDK.
As I wrote before, if I understand the coding in glibc between 2.4 and 2.24 correctly, I think it uses vfork() and that should be fine by me: posix_spawn() using vfork(), with no attributes/file actions and in conjunction with the jspawnhelper, is almost exactly the same as the proposed vfork() + exec*2 patch: posix_spawn() will exec() immediately after the vfork(), then, in jspwnhelper, we set up the new process and exec() again. So I am fine with that. Provided I have understood all that stuff correctly and not made a thinking error somewhere. Cheers, Thomas
Thanks, Florian