RFR 9: 8077350 Process API Updates Implementation Review

Fri Apr 17 17:05:32 UTC 2015

Hi Thomas,

On 4/16/2015 3:01 PM, Thomas Stüfe wrote:
> Hi Roger,
>
> thank you for your answer!
>
> The reason I take an interest is not just theoretical. We (SAP) use 
> our JVM for our test infrastructure and we had exactly the problem 
> allChildren() is designed to solve: killing a process tree related to 
> a specific tests (similar to jtreg tests) in case of errors or hangs. 
> We have test machines running large workloads of tests in parallel and 
> we reach pid wraparound - depending on the OS - quite fast.
>
> We solved this by adding process groups to Process.java and we are 
> very happy with this solution. We are able to quickly kill a whole 
> process tree, cleanly and completely, without ambiguity or risk to 
> other tests. Of course we had to add this support as a "sideways hack" 
> in order to not change the official Process.java interface. Therefore 
> I was hoping that with JEP 102, we would get official support for 
> process groups. Unfortunately, seems the decision is already done and 
> we are too late in the discussion :(
It would be interesting to see a description of what you added to/around 
the API.
The reason to avoid them was one of simplicity and non-interference with 
processes
spawned by native libraries.  If that complexity can be understood 
process groups/jobs
could fulfill a need in a scalable system.

At this point, I'd like to deal with it as a separate request for 
enhancement.

>
> see my other comments inline.
>
> On Sat, Apr 11, 2015 at 8:55 PM, Roger Riggs <Roger.Riggs at oracle.com 
> <mailto:Roger.Riggs at oracle.com>> wrote:
>
>     Hi Thomas,
>
>     Thanks for the comments.
>
>     On 4/11/2015 8:31 AM, Thomas Stüfe wrote:
>>     Hi Roger,
>>
>>     I have a question about getChildren() and getAllChildren().
>>
>>     I assume the point of those functions is to implement point 4 of
>>     JEP 102 ("The ability to deal with process trees, in particular
>>     some means to destroy a process tree."), by returning a
>>     collection of PIDs which are the children of the process and then
>>     killing them?
>     Earlier versions included a killProcess tree method but it was
>     recommended to leave
>     the exact algorithm to kill processes to the caller.
>>
>>     However, I am not sure that this can be implemented in a safe
>>     way, at least on UNIX, because - as Martin already pointed out -
>>     of PID recycling. I do not see how you can prevent allChildren()
>>     from returning PIDs which may be already reaped and recyled when
>>     you use them later. How do you prevent that?
>     Unless there is an extended time between getting the children and
>     destroying them the pids will still be valid.
>
>
> Why? Child process may be getting reaped the instant you are done 
> reading it from /proc, and pid may have been recycled by the OS right 
> away and already pointing to another process when allChildren() 
> returns. If a process lives about as long as it takes the system to 
> reach a pid wraparound to the same pid value, its pid could be 
> recycled right after it is reaped, or? Sure, the longer you wait, the 
> higher the chance of this to happen, but it may happen right away.
>
> As Martin said, we had those races in the kill() code since a long 
> time, but children()/allChildren() could make those error more 
> probable, because now more processes are involved. Especially if you 
> use allChildren to kill a deep process tree. And there is nothing in 
> the javadoc warning the user about this scenario. You would just 
> happen from time to time to kill an unrelated process. Those problems 
> are hard to debug.
>
>     The technique of caching the start time can prevent that case;
>     though it has AFAIK not been a problem.
>
>
> How would that work? User should, before issuing the kill, compare 
> start time of process to kill with cached start time?
See Peter's email, he described it more thoroughly that I have in 
previous emails.
>
>>     Note even if your coding is bulletproof, that allChildren() will
>>     also return PIDs of sub processes which are completely unrelated
>>     to you and Process.java - they could have been forked by some
>>     third party native code which just happens to run in parallel in
>>     the same process. There, you have no control about when it gets
>>     reaped. It might already have been reaped by the time
>>     allChildren() returns, and now the same PID got recycled as
>>     another, unrelated process.
>     Of course, the best case is for an application to spawn and manage
>     its own processes
>     and handle there proper termination.
>     The use cases for children/allChildren are focused on
>     supervisory/executive functions
>     that monitor a running system and can cleanup even in the case of
>     unexpected failures.
>
>     All management of processes is subject to OS limitations, if the
>     PID were from a completely
>     different process tree, the ordinary destroy/info functions would
>     not be available
>     unless the process was running as a privileged os user (same as
>     any other native application).
>
>
> Could you explain this please? If both trees run under the same user, 
> why should I not be able to kill a process from a different tree?
I was considering the case of a different user; only the OS access 
controls apply
so if it was the same user the processes could be controlled.
The PH API does not provide more or less access than the OS.

Thanks, Roger

>>     If I am right, it would not be sufficient to state "There is no
>>     guarantee that a process is alive." - it may be alive but by now
>>     be a different process altogether. This makes "allChildren()"
>>     useless for many cases, because the returned information may
>>     already be obsolete the moment the function returns.
>     The caching of startTime can remove the ambiguity.
>
>
>>
>>     Of course I may something missing here?
>>
>>     But if I got all that right and the sole purpose of allChildren()
>>     is to be able to kill them (or otherwise signal them), why not
>>     use process groups? Process groups would be the traditional way
>>     on POSIX platforms to handle process trees, and they are also
>>     available on Windows in the form of Job Objects.
>>
>>     Using process groups to signal sub process trees would be safe,
>>     would not rely on PID identity, and would be more efficient. Also
>>     way less coding. Also, it would be an old, established pattern -
>>     process groups have been around for a long time. Also, using
>>     process groups it is possible to break away from a group, so a
>>     program below you which wants to run as a demon can do so by
>>     removing itself from the process group and thus escaping your kill.
>>
>>     On Windows we have Job objects, and I think there are enough
>>     similarities to POSIX process groups to abstract them into
>>     something platform independent.
>     Earlier discussions of process termination and exit value reaping
>     considered
>     using process groups but it became evident that the Java runtime
>     needed to
>     be very careful to not interfere with processes that might be
>     spawned and
>     controlled by native libraries and that process groups would only
>     increase
>     complexity and the interactions.
>
>
>     Thanks, Roger
>
>
> Thanks! Thomas
>