RFR 9: 8077350 Process API Updates Implementation Review

Fri Apr 17 06:40:31 UTC 2015

> On 16 apr 2015, at 21:01, Thomas Stüfe <thomas.stuefe at gmail.com> wrote:
> 
> Hi Roger,
> 
> thank you for your answer!
> 
> The reason I take an interest is not just theoretical. We (SAP) use our JVM
> for our test infrastructure and we had exactly the problem allChildren() is
> designed to solve: killing a process tree related to a specific tests
> (similar to jtreg tests) in case of errors or hangs. We have test machines
> running large workloads of tests in parallel and we reach pid wraparound -
> depending on the OS - quite fast.
> 
> We solved this by adding process groups to Process.java and we are very
> happy with this solution. We are able to quickly kill a whole process tree,
> cleanly and completely, without ambiguity or risk to other tests. Of course
> we had to add this support as a "sideways hack" in order to not change the
> official Process.java interface. Therefore I was hoping that with JEP 102,
> we would get official support for process groups. Unfortunately, seems the
> decision is already done and we are too late in the discussion :(

Interestingly we are hoping to use allChildren() to kill process trees in jtreg - exactly the use case you are describing. I haven’t been testing the current approach in allChildren(), but it seems your experience indicates that it will not be a perfect fit for the use case. In a previous test framework I was involved in we also used process groups for this with good results. This does beg the question: if the current approach isn’t useful for our own testing purposes, when is it useful?

Thanks,
/Staffan

> 
> see my other comments inline.
> 
> On Sat, Apr 11, 2015 at 8:55 PM, Roger Riggs <Roger.Riggs at oracle.com <mailto:Roger.Riggs at oracle.com>> wrote:
> 
>> Hi Thomas,
>> 
>> Thanks for the comments.
>> 
>> On 4/11/2015 8:31 AM, Thomas Stüfe wrote:
>> 
>> Hi Roger,
>> 
>> I have a question about getChildren() and getAllChildren().
>> 
>> I assume the point of those functions is to implement point 4 of JEP 102
>> ("The ability to deal with process trees, in particular some means to
>> destroy a process tree."), by returning a collection of PIDs which are the
>> children of the process and then killing them?
>> 
>> Earlier versions included a killProcess tree method but it was recommended
>> to leave
>> the exact algorithm to kill processes to the caller.
>> 
>> 
>> However, I am not sure that this can be implemented in a safe way, at
>> least on UNIX, because - as Martin already pointed out - of PID recycling.
>> I do not see how you can prevent allChildren() from returning PIDs which
>> may be already reaped and recyled when you use them later. How do you
>> prevent that?
>> 
>> Unless there is an extended time between getting the children and
>> destroying them the pids will still be valid.
>> 
> 
> Why? Child process may be getting reaped the instant you are done reading
> it from /proc, and pid may have been recycled by the OS right away and
> already pointing to another process when allChildren() returns. If a
> process lives about as long as it takes the system to reach a pid
> wraparound to the same pid value, its pid could be recycled right after it
> is reaped, or? Sure, the longer you wait, the higher the chance of this to
> happen, but it may happen right away.
> 
> As Martin said, we had those races in the kill() code since a long time,
> but children()/allChildren() could make those error more probable, because
> now more processes are involved. Especially if you use allChildren to kill
> a deep process tree. And there is nothing in the javadoc warning the user
> about this scenario. You would just happen from time to time to kill an
> unrelated process. Those problems are hard to debug.
> 
> The technique of caching the start time can prevent that case; though it
>> has AFAIK not been a problem.
>> 
> 
> How would that work? User should, before issuing the kill, compare start
> time of process to kill with cached start time?
> 
>> Note even if your coding is bulletproof, that allChildren() will also
>> return PIDs of sub processes which are completely unrelated to you and
>> Process.java - they could have been forked by some third party native code
>> which just happens to run in parallel in the same process. There, you have
>> no control about when it gets reaped. It might already have been reaped by
>> the time allChildren() returns, and now the same PID got recycled as
>> another, unrelated process.
>> 
>> Of course, the best case is for an application to spawn and manage its own
>> processes
>> and handle there proper termination.
>> The use cases for children/allChildren are focused on
>> supervisory/executive functions
>> that monitor a running system and can cleanup even in the case of
>> unexpected failures.
>> 
> All management of processes is subject to OS limitations, if the PID were
>> from a completely
>> different process tree, the ordinary destroy/info functions would not be
>> available
>> unless the process was running as a privileged os user (same as any other
>> native application).
>> 
> 
> Could you explain this please? If both trees run under the same user, why
> should I not be able to kill a process from a different tree?
> 
>> If I am right, it would not be sufficient to state "There is no guarantee
>> that a process is alive." - it may be alive but by now be a different
>> process altogether. This makes "allChildren()" useless for many cases,
>> because the returned information may already be obsolete the moment the
>> function returns.
>> 
>> The caching of startTime can remove the ambiguity.
>> 
> 
>> 
>> Of course I may something missing here?
>> 
>> But if I got all that right and the sole purpose of allChildren() is to
>> be able to kill them (or otherwise signal them), why not use process
>> groups? Process groups would be the traditional way on POSIX platforms to
>> handle process trees, and they are also available on Windows in the form of
>> Job Objects.
>> 
>> Using process groups to signal sub process trees would be safe, would
>> not rely on PID identity, and would be more efficient. Also way less
>> coding. Also, it would be an old, established pattern - process groups have
>> been around for a long time. Also, using process groups it is possible to
>> break away from a group, so a program below you which wants to run as a
>> demon can do so by removing itself from the process group and thus escaping
>> your kill.
>> 
>> On Windows we have Job objects, and I think there are enough
>> similarities to POSIX process groups to abstract them into something
>> platform independent.
>> 
>> Earlier discussions of process termination and exit value reaping
>> considered
>> using process groups but it became evident that the Java runtime needed to
>> be very careful to not interfere with processes that might be spawned and
>> controlled by native libraries and that process groups would only increase
>> complexity and the interactions.
>> 
> 
>> Thanks, Roger
>> 
>> 
> Thanks! Thomas