RFR:8205330 InitialDirContext ctor sometimes throws NPE if the server has sent a disconnection

Mon Sep 10 12:00:10 UTC 2018

Vyom,

The NPE is originating from the finally block of the LdapClient
`authenticate` method. In the finally block there is an attempt
to restore the "read" timeout, since it was changed earlier in
`authenticate`, to reflect the connect timeout value, since the
subsequent operation(s) are considered part of the connect phase.

An unsolicited message may close the connection during
authenticate, which is does and LdapClient.ldapBind throws
"javax.naming.CommunicationException: Request: 1 cancelled". The
finally block is then executed and the previous exception is
effectively throw away when the NPE is encountered.

If we agree that such behaviour is reasonable ( and I think it
most likely is ), then the finally block needs to be more
defensive - check that conn is not null. It makes no sense to
attempt to reset the read timeout if conn is null.

With the following change, then the CommunicationException will
be thrown from `authenticate`. I think this is the behaviour we
want, no?

--- a/src/java.naming/share/classes/com/sun/jndi/ldap/LdapClient.java
+++ b/src/java.naming/share/classes/com/sun/jndi/ldap/LdapClient.java
@@ -297,7 +297,8 @@
              conn.setV3(isLdapv3);
              return res;
          } finally {
-            conn.readTimeout = readTimeout;
+            if (conn != null)
+                conn.readTimeout = readTimeout;
          }
      }


-Chris.


On 07/09/18 17:47, Daniel Fuchs wrote:
> Hi Vyom,
> 
> Please see inline... I've elided some parts...
> 
> On 07/09/2018 10:11, vyom tewari wrote:
>> On Tuesday 04 September 2018 08:13 PM, Daniel Fuchs wrote:
>>> So what we see here is actually a race between an open
>>> connection being retrieved from the pool, and the server
>>> deciding to close that connection.
>>> The connection/client is retrieved from the pool, then
>>> unfortunately closed by the server after having been selected
>>> for the next operation.
>>>
>> I checked  the code and i did not found any problem. Both "get" and 
>> "removePooledConnection" are "synchronized" so there should not be any 
>> problem(concurrency issue) here.
> 
> I'm referring to this pattern (in forceClose):
> 
>              conn.cleanup(null, false);
>              conn = null;
> 
>              if (cleanPool) {
>                  pcb.removePooledConnection(this);
>              }
> 
> The connection is cleaned up, then nulled, then only removed
> from the pool. I'm questioning whether the first thing to do would
> be to remove the connection from the pool, to reduce the time window
> in which the client could be retrieved from the pool by another
> thread while forceClose is in process.
> 
> I am not sure what side effect this could have - as I haven't
> studied the code that retrieves the client from the pool,
> but I'm wondering whether that's worth exploring.
> 
>> As you said server is closing the same connection which is retrieved 
>> from pool which  is correct but this should not throw unintentional NPE.
> 
> Agreed.
> 
>> In LdapClient, we set the "Ldapclient.conn" to explicitly null 
>> asynchronously after we remove the connection from pool and which is 
>> creating the NPE.
>>
>> LdapClient does not set `Ldapclient.conn` to null if connection is not 
>> pooled.
> 
> Really? That's not what I see... Am I missing something?
> 
>  >> [...]
>>> If so will leaving the 'conn' field assigned ensure that
>>> the retry happens, or will new InitialDirContext() fail
>>> with some other exception because the connection has
>>> been closed? Or does the code simply assume that asynchronous
>>> closing of the connection by the server can only happen while
>>> it's sitting idle in the pool?
>>>
>>   I don't know if retrying is feasible in 'IntialDirContext' but if we 
>> leave 'LdapClient.conn' assigned then we will get 
>> "javax.naming.CommunicationException" which tells that, underline 
>> connection is closed. I am not 100% sure if this is right approach but 
>> if we set 'LdapClient.conn' to null asynchronously then we will hit NPE.
> 
> OK - that's good to know. I agree that CommunicationException is
> better than NPE. If so then not nulling the connection
> is at least an improvement.
> But this would deserve a comment in the code, I think.
> More below...
> 
>>> I wonder if some other mechanism could be used to reduce
>>> and hopefully eliminate the time window in which the race
>>> could appear. For instance taking the connection out of
>>> the pool before closing it, instead of closing it before
>>> taking it out of the pool, etc...
>>>
>> Changing order will not help, i think closing the  same connection is 
>> not a problem  but setting `LdapClient.conn'  to null asynchronously 
>> is causing NPE.
> 
> Yes - I agree that setting conn to null is problematic.
> I wonder why it's set to null. Maybe in an attempt to
> make it eligible for GC earlier?
> 
> If that's not the issue, then I would suggest making it
> final (AFAICS it's only set to non-null once) and cleaning
> up the code that tests for conn == null.
> 
> If GC is the issue, then the best you can do is probably
> only set it to null when the client is not pooled.
> 
> And if we could reduce the time window in which a thread
> could get a stale client from the pool that might be good
> too?
> 
> Is there anyway we could somehow add a test that reproduce
> the NPE?
> 
> best regards,
> 
> -- daniel
> 
> [...]
> 
>>>>> On 24/08/18 11:35, vyom tewari wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> Please review this simple fix below
>>>>>>
>>>>>> webrev: 
>>>>>> http://cr.openjdk.java.net/~vtewari/8205330/webrev0.0/index.html
>>>>>>
>>>>>> bugid: https://bugs.openjdk.java.net/browse/JDK-8205330
>>>>>>
>>>>>> This fix will resolve the race in LdapClient  where we are 
>>>>>> explicitly making "null" to LdapClient.conn.
> [...]