RFR: 7903217: jtreg could try killing descendants of stuck test, before timing out the test [v4]
Thomas Stuefe
stuefe at openjdk.org
Mon Nov 28 18:16:10 UTC 2022
On Mon, 22 Aug 2022 20:54:36 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:
>> This is an enhancement that aims to improve the robustness of the testing by attempting to quit any child processes (that are possibly stuck and are blocking the parent process from terminating) before timing out the target parent process.
>>
>> Aborting a process will flush its stdout/stderr streams, which will hopefully get captured in the test's log and provide additional clues as to why a test was timing out.
>>
>> This enhancement was locally tested with a handcrafted test that itself launched a child process that would get stuck on purpose and worked as intended.
>>
>> Hopefully, this will help debug issues such as [JDK-8286345](https://bugs.openjdk.org/browse/JDK-8286345)
>
> Gerard Ziemski has updated the pull request incrementally with one additional commit since the last revision:
>
> return exit code if we had to cancel any child processes
Hi Gerard,
> > Hi Gerard,
> > What you are trying to do is useful and appreciated. I often was missing more info. But I'm unsure too if handling this at jtreg level is the best thing. But I see two sides here and therefore keep out of the discussion.
> > But another thing, in order for this to be useful, we would need thread dumps from hanging children too, if the children happen to be JVMs. Just hoping that abort(3) will nudge the children enough to vomit some output will not often work. E.g. if jcmd hangs, it is usually innocent: it waits for an answer from the attachee, and that one is stuck. It would be perfectly able to react to a thread dump and tell me as much.
> > So, before killing them, send each of them a SIGQUIT to get thread dumps and give them a bit of time to respond. And that raises more questions. If you do this, especially wholesale for all children, you could absolutely flood the jtr files with thread dumps from children, and analyzing them gets really confusing.
> > Not sure what a good solution could be. Let's see what others think.
> > Cheers, Thomas
>
> Thank you Thomas for your feedback.
>
> I really like your idea to SIGQUIT the children processes before force quitting them. And we can kill those intelligently to try and avoid leaving behind zombies.
>
> There is no progress on many (all?) of the related bugs that I listed earlier. My hope is that logging more info from all processes involved would lead to a breakthrough that would allow the analysis to unblock. More info is the key in those issues I believe, so if there is an increase in some noise, then I think it is a fair price to pay.
>
> Not sure what you mean by "flooding the jtr" issue - do you mean that too long jtr log files get trimmed ?
Yes, but also them being just plain indecipherable.
> I really dislike when that happens. Unsure when we decided to do this, but nowadays disk space should be plentiful to handle full log files without cutting them down, I would hope.
I'm a bit skeptical about finding one error that explains all. You may have to dive down into each one and do a targeted analysis.
wrt [JDK-8286345](https://bugs.openjdk.org/browse/JDK-8286345), I would be curious to see log files. The core was from jcmd or from the main program? I also would add some more checks to the OutputAnalyzer for the jcmd process (should have exited normally, etc) and maybe do a diagnostic report via output.reportDiagnosticSummary();
Note that ThreadedMallocTestType.java also relies on there being no other allocations with the mtTest category. That should be true, but is still a bit fragile.
Cheers, Thomas
-------------
PR: https://git.openjdk.org/jtreg/pull/97
More information about the jtreg-dev
mailing list