RFR [8024521] (process) Async close issues with Process InputStream

Tue Oct 15 06:29:43 UTC 2013

Hi

I'm the author of this code.
I'm perfectly willing to believe the close/drain code is buggy.
It is difficult to get right.
I'm looking forward to the updated webrev.

Often the best way to handle a stress test is to make it run quickly by
default, but provide a system property to make it run as long as desired,
for manual ad-hoc testing.

On Mon, Oct 14, 2013 at 9:04 AM, Ivan Gerasimov
<ivan.gerasimov at oracle.com>wrote:

> Thanks Paul!
>
> Yes, I like your solution better.
> I'll test it and send the updated webrev shortly.
>
> Sincerely yours,
> Ivan
>
>
>
> On 14.10.2013 15:33, Paul Sandoz wrote:
>
>> On Oct 14, 2013, at 11:36 AM, Ivan Gerasimov <ivan.gerasimov at oracle.com>
>> wrote:
>>
>>  Hi Paul,
>>>
>>> Thank you for looking at the webrev!
>>>
>>> On 14.10.2013 12:50, Paul Sandoz wrote:
>>>
>>>> Hi Ivan,
>>>>
>>>> Why do you require closeLock? is it not sufficient leave processExited
>>>> as is and just add:
>>>>
>>>> public synchronized void close() throws IOException {
>>>>      super.close();
>>>> }
>>>>
>>> I want to limit the impact on the code, that's why I synchronize close()
>>> with only processExited().
>>> If I made close() synchronized with all the stream's methods, then I
>>> think it could introduce some unpredictable drawback.
>>>
>>>  Yeah, the close will block until the drain is complete, which might
>> also have some unpredictable drawback.
>>
>> A concern is there was already some code in place to check for async
>> close, which if detected throws away all the data and sets the underlying
>> filtered input stream is set to null. So all that effort draining is wasted.
>>
>> Another solution is to keep checking in the loop of drainInputStream for
>> an async close and fail-fast:
>>
>>          private InputStream drainInputStream(InputStream in)
>>                  throws IOException {
>>              int n = 0;
>>              int j;
>>              byte[] a = null;
>>              while ((j = in.available()) > 0) {
>>                  if (buf == null) // asynchronous close()?
>>                      return null; // stop reading and discard
>>                  a = (a == null) ? new byte[j] : Arrays.copyOf(a, n + j);
>>                  n += in.read(a, n, j);
>>              }
>>
>>              if (buf == null) // asynchronous close()?
>>                  return null; // discard
>>              else if (a == null)
>>                  return ProcessBuilder.**NullInputStream.INSTANCE;
>>              else
>>                  return ByteArrayInputStream(n == a.length ? a :
>> Arrays.copyOf(a, n));
>>          }
>>
>>          /** Called by the process reaper thread when the process exits.
>> */
>>          synchronized void processExited() {
>>              // Most BufferedInputStream methods are synchronized, but
>> close()
>>              // is not, and so we have to handle concurrent racing
>> close().
>>              try {
>>                  InputStream in = this.in;
>>                  if (in != null) {
>>                      InputStream stragglers = drainInputStream(in);
>>                      in.close();
>>                      this.in = stragglers;
>>                  }
>>              } catch (IOException ignored) {
>>                  // probably an asynchronous close().
>>              }
>>          }
>>
>> This still has the potential to look at the availability of the current
>> input before async closing, and read from a different input after async
>> closing, but it will break out on the next loop check if there are more
>> bytes available to read otherwise it will be caught again on the final
>> check, so no munged data will be exposed.
>>
>>
>>
>>  Plus there was already some logic checking if close was asynchronously
>>>> called (see BufferedInputStream.close):
>>>>
>>>>   372                         if (buf == null) // asynchronous close()?
>>>>   373                             this.in = null;
>>>>
>>>> and this would no longer be required
>>>>
>>> Yes, you're right and these two lines can be removed.
>>> I wouldn't remove the surrounding try {} catch (IOException) though, as
>>> the stream can still be closed asynchronously.
>>>
>>>  I am somewhat struggling to understand the test (not too familiar with
>>>> this area). Requires a more expert eye in this area than I.
>>>>
>>>> IIUC you are trying to induce the the output process fd of ExecLoop to
>>>> "flip" to another fd opened by OpenLoop.
>>>>
>>> Yes.
>>> I actually adopted a testcase provided by the reporter of the problem.
>>> I only changed the test in such a way that it can be run in an automated
>>> mode.
>>>
>>> Below is the exact comment from the original testcase.
>>> It may be a good idea to include it in the regtest.
>>>
>>> /* Following code is a test case that demonstrates a close race in
>>> ProcessPipeInputStream.
>>> *.
>>> * A snippet of the offending code within ProcessPipeInputStream:.
>>> *       // Called by the process reaper thread when the process exits..
>>> *       synchronized void processExited() {
>>> *          // Most BufferedInputStream methods are synchronized, but
>>> close()
>>> *           // is not, and so we have to handle concurrent racing
>>> close().
>>> *           try {
>>> *               InputStream in = this.in;
>>> *               if (in != null) {
>>> *                   byte[] stragglers = drainInputStream(in);
>>> *                   in.close();
>>> *
>>> * The comment even talks about the race. However, the code is not
>>> adequately
>>> * protecting against it. drainInputStream() (available() in particular)
>>> is not
>>> * threadsafe when a close happens in parallel. What's worse is that a
>>> subsequent
>>> * open() is likely to reassign the closed fd so drainInputStream ends up
>>> draining
>>> * the wrong fd! This can cause OOMs, data corruption on other fds, and
>>> other very
>>> * bad things.
>>> *.
>>> * Notes:.
>>> *  - Test creates a large sparse file in /tmp/bigsparsefile
>>> *  - Test uses "/proc/meminfo" for simple input..
>>> *  - Invoke using "-Xmx128m" to insure file size of 1G is greater than
>>> heap size
>>> *  - Test will run forever, if successful there will not be any OOM
>>> exceptions
>>> *  (on my test system they occur about 1 per minute)
>>> *  - You need multiple cores (I use 2 x Xeon E5620 = 8 cores, 16
>>> threads).
>>> *..
>>> * The basic premise of the test is to create a 1GB sparse file. Then
>>> have lots
>>> * of threads opening and closing this file. In a loop within a separate
>>> thread use UNIXprocess
>>> * to run a simple command. We read the output of the command, make sure
>>> the process
>>> * has exited and then close the inputstream. When the bug happens,
>>> processExited()
>>> * will use FIONREAD to figure out how much data to drain BUT the fd it
>>> is checking
>>> * will be closed and reopened (now referring  to /tmp/bigsparsefile) so
>>> it will
>>> * see the size as 1GB and then attempt to allocate a buffer of that size
>>> and.
>>> * then OOM..
>>> */
>>>
>>>
>>>
>>>    The test concerns me since it is gonna use up loads of resources and
>>>> proceed for 10 minutes until deciding beyond reasonable doubt that there is
>>>> no issue.
>>>>
>>> When I ran the test across different machines (with the unpatched jdk)
>>> it reported from one to five races during 10 minutes.
>>> We can surely decrease the timeout for the price of higher probability
>>> of a false positive result.
>>>
>>>  Unit-like tests should really run in under a minute, plus should
>> ideally be frugal in their resource consumption otherwise it could starve
>> other tests running in parallel. If we cannot create a more deterministic
>> test (not sure it is possible) then perhaps this test belongs to SQE and
>> not in the JDK tests?
>>
>> Paul.
>>
>
>