RFR:8205330 InitialDirContext ctor sometimes throws NPE if the server has sent a disconnection

Mon Sep 10 15:32:29 UTC 2018

Vyom,

You are correct, the change that I proposed is minimal
and some fragility still remains.

I talked with Daniel, walked through the code one more
time, and I am now convinced that your original proposal
is correct ( since the pooling is at the level of
LdapClient, rather than at the level of Connection ).

Can I also suggest:
  1) Remove the null assignments, rather commenting them
     out.
  2) Make Connection `conn` final.
  3) Remove all null checks for conn ( since it can no
     longer be null ).

-Chris.

On 10/09/18 15:53, vyom tewari wrote:
> 
> 
> On Monday 10 September 2018 05:30 PM, Chris Hegarty wrote:
>> Vyom,
>>
>> The NPE is originating from the finally block of the LdapClient
>> `authenticate` method. In the finally block there is an attempt
>> to restore the "read" timeout, since it was changed earlier in
>> `authenticate`, to reflect the connect timeout value, since the
>> subsequent operation(s) are considered part of the connect phase.
>>
>> An unsolicited message may close the connection during
>> authenticate, which is does and LdapClient.ldapBind throws
>> "javax.naming.CommunicationException: Request: 1 cancelled". The
>> finally block is then executed and the previous exception is
>> effectively throw away when the NPE is encountered.
>>
>> If we agree that such behaviour is reasonable ( and I think it
>> most likely is ), then the finally block needs to be more
>> defensive - check that conn is not null. It makes no sense to
>> attempt to reset the read timeout if conn is null.
>>
>> With the following change, then the CommunicationException will
>> be thrown from `authenticate`. I think this is the behaviour we
>> want, no?
>>
> Yes , this will definitely solve this NPE issue but we may hit NPE other 
> places as well because we are setting 'LdapClient.com' to null 
> asynchronously.
> Vyom
>>
>> --- a/src/java.naming/share/classes/com/sun/jndi/ldap/LdapClient.java
>> +++ b/src/java.naming/share/classes/com/sun/jndi/ldap/LdapClient.java
>> @@ -297,7 +297,8 @@
>>              conn.setV3(isLdapv3);
>>              return res;
>>          } finally {
>> -            conn.readTimeout = readTimeout;
>> +            if (conn != null)
>> +                conn.readTimeout = readTimeout;
>>          }
>>      }
>>
>>
>> -Chris.
>>
>>
>> On 07/09/18 17:47, Daniel Fuchs wrote:
>>> Hi Vyom,
>>>
>>> Please see inline... I've elided some parts...
>>>
>>> On 07/09/2018 10:11, vyom tewari wrote:
>>>> On Tuesday 04 September 2018 08:13 PM, Daniel Fuchs wrote:
>>>>> So what we see here is actually a race between an open
>>>>> connection being retrieved from the pool, and the server
>>>>> deciding to close that connection.
>>>>> The connection/client is retrieved from the pool, then
>>>>> unfortunately closed by the server after having been selected
>>>>> for the next operation.
>>>>>
>>>> I checked  the code and i did not found any problem. Both "get" and 
>>>> "removePooledConnection" are "synchronized" so there should not be 
>>>> any problem(concurrency issue) here.
>>>
>>> I'm referring to this pattern (in forceClose):
>>>
>>>              conn.cleanup(null, false);
>>>              conn = null;
>>>
>>>              if (cleanPool) {
>>>                  pcb.removePooledConnection(this);
>>>              }
>>>
>>> The connection is cleaned up, then nulled, then only removed
>>> from the pool. I'm questioning whether the first thing to do would
>>> be to remove the connection from the pool, to reduce the time window
>>> in which the client could be retrieved from the pool by another
>>> thread while forceClose is in process.
>>>
>>> I am not sure what side effect this could have - as I haven't
>>> studied the code that retrieves the client from the pool,
>>> but I'm wondering whether that's worth exploring.
>>>
>>>> As you said server is closing the same connection which is retrieved 
>>>> from pool which  is correct but this should not throw unintentional 
>>>> NPE.
>>>
>>> Agreed.
>>>
>>>> In LdapClient, we set the "Ldapclient.conn" to explicitly null 
>>>> asynchronously after we remove the connection from pool and which is 
>>>> creating the NPE.
>>>>
>>>> LdapClient does not set `Ldapclient.conn` to null if connection is 
>>>> not pooled.
>>>
>>> Really? That's not what I see... Am I missing something?
>>>
>>>  >> [...]
>>>>> If so will leaving the 'conn' field assigned ensure that
>>>>> the retry happens, or will new InitialDirContext() fail
>>>>> with some other exception because the connection has
>>>>> been closed? Or does the code simply assume that asynchronous
>>>>> closing of the connection by the server can only happen while
>>>>> it's sitting idle in the pool?
>>>>>
>>>>   I don't know if retrying is feasible in 'IntialDirContext' but if 
>>>> we leave 'LdapClient.conn' assigned then we will get 
>>>> "javax.naming.CommunicationException" which tells that, underline 
>>>> connection is closed. I am not 100% sure if this is right approach 
>>>> but if we set 'LdapClient.conn' to null asynchronously then we will 
>>>> hit NPE.
>>>
>>> OK - that's good to know. I agree that CommunicationException is
>>> better than NPE. If so then not nulling the connection
>>> is at least an improvement.
>>> But this would deserve a comment in the code, I think.
>>> More below...
>>>
>>>>> I wonder if some other mechanism could be used to reduce
>>>>> and hopefully eliminate the time window in which the race
>>>>> could appear. For instance taking the connection out of
>>>>> the pool before closing it, instead of closing it before
>>>>> taking it out of the pool, etc...
>>>>>
>>>> Changing order will not help, i think closing the  same connection 
>>>> is not a problem  but setting `LdapClient.conn'  to null 
>>>> asynchronously is causing NPE.
>>>
>>> Yes - I agree that setting conn to null is problematic.
>>> I wonder why it's set to null. Maybe in an attempt to
>>> make it eligible for GC earlier?
>>>
>>> If that's not the issue, then I would suggest making it
>>> final (AFAICS it's only set to non-null once) and cleaning
>>> up the code that tests for conn == null.
>>>
>>> If GC is the issue, then the best you can do is probably
>>> only set it to null when the client is not pooled.
>>>
>>> And if we could reduce the time window in which a thread
>>> could get a stale client from the pool that might be good
>>> too?
>>>
>>> Is there anyway we could somehow add a test that reproduce
>>> the NPE?
>>>
>>> best regards,
>>>
>>> -- daniel
>>>
>>> [...]
>>>
>>>>>>> On 24/08/18 11:35, vyom tewari wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Please review this simple fix below
>>>>>>>>
>>>>>>>> webrev: 
>>>>>>>> http://cr.openjdk.java.net/~vtewari/8205330/webrev0.0/index.html
>>>>>>>>
>>>>>>>> bugid: https://bugs.openjdk.java.net/browse/JDK-8205330
>>>>>>>>
>>>>>>>> This fix will resolve the race in LdapClient  where we are 
>>>>>>>> explicitly making "null" to LdapClient.conn.
>>> [...]
>