RFR:8205330 InitialDirContext ctor sometimes throws NPE if the server has sent a disconnection

Mon Sep 10 14:53:10 UTC 2018


On Monday 10 September 2018 05:30 PM, Chris Hegarty wrote:
> Vyom,
>
> The NPE is originating from the finally block of the LdapClient
> `authenticate` method. In the finally block there is an attempt
> to restore the "read" timeout, since it was changed earlier in
> `authenticate`, to reflect the connect timeout value, since the
> subsequent operation(s) are considered part of the connect phase.
>
> An unsolicited message may close the connection during
> authenticate, which is does and LdapClient.ldapBind throws
> "javax.naming.CommunicationException: Request: 1 cancelled". The
> finally block is then executed and the previous exception is
> effectively throw away when the NPE is encountered.
>
> If we agree that such behaviour is reasonable ( and I think it
> most likely is ), then the finally block needs to be more
> defensive - check that conn is not null. It makes no sense to
> attempt to reset the read timeout if conn is null.
>
> With the following change, then the CommunicationException will
> be thrown from `authenticate`. I think this is the behaviour we
> want, no?
>
Yes , this will definitely solve this NPE issue but we may hit NPE other 
places as well because we are setting 'LdapClient.com' to null 
asynchronously.
Vyom
>
> --- a/src/java.naming/share/classes/com/sun/jndi/ldap/LdapClient.java
> +++ b/src/java.naming/share/classes/com/sun/jndi/ldap/LdapClient.java
> @@ -297,7 +297,8 @@
>              conn.setV3(isLdapv3);
>              return res;
>          } finally {
> -            conn.readTimeout = readTimeout;
> +            if (conn != null)
> +                conn.readTimeout = readTimeout;
>          }
>      }
>
>
> -Chris.
>
>
> On 07/09/18 17:47, Daniel Fuchs wrote:
>> Hi Vyom,
>>
>> Please see inline... I've elided some parts...
>>
>> On 07/09/2018 10:11, vyom tewari wrote:
>>> On Tuesday 04 September 2018 08:13 PM, Daniel Fuchs wrote:
>>>> So what we see here is actually a race between an open
>>>> connection being retrieved from the pool, and the server
>>>> deciding to close that connection.
>>>> The connection/client is retrieved from the pool, then
>>>> unfortunately closed by the server after having been selected
>>>> for the next operation.
>>>>
>>> I checked  the code and i did not found any problem. Both "get" and 
>>> "removePooledConnection" are "synchronized" so there should not be 
>>> any problem(concurrency issue) here.
>>
>> I'm referring to this pattern (in forceClose):
>>
>>              conn.cleanup(null, false);
>>              conn = null;
>>
>>              if (cleanPool) {
>>                  pcb.removePooledConnection(this);
>>              }
>>
>> The connection is cleaned up, then nulled, then only removed
>> from the pool. I'm questioning whether the first thing to do would
>> be to remove the connection from the pool, to reduce the time window
>> in which the client could be retrieved from the pool by another
>> thread while forceClose is in process.
>>
>> I am not sure what side effect this could have - as I haven't
>> studied the code that retrieves the client from the pool,
>> but I'm wondering whether that's worth exploring.
>>
>>> As you said server is closing the same connection which is retrieved 
>>> from pool which  is correct but this should not throw unintentional 
>>> NPE.
>>
>> Agreed.
>>
>>> In LdapClient, we set the "Ldapclient.conn" to explicitly null 
>>> asynchronously after we remove the connection from pool and which is 
>>> creating the NPE.
>>>
>>> LdapClient does not set `Ldapclient.conn` to null if connection is 
>>> not pooled.
>>
>> Really? That's not what I see... Am I missing something?
>>
>>  >> [...]
>>>> If so will leaving the 'conn' field assigned ensure that
>>>> the retry happens, or will new InitialDirContext() fail
>>>> with some other exception because the connection has
>>>> been closed? Or does the code simply assume that asynchronous
>>>> closing of the connection by the server can only happen while
>>>> it's sitting idle in the pool?
>>>>
>>>   I don't know if retrying is feasible in 'IntialDirContext' but if 
>>> we leave 'LdapClient.conn' assigned then we will get 
>>> "javax.naming.CommunicationException" which tells that, underline 
>>> connection is closed. I am not 100% sure if this is right approach 
>>> but if we set 'LdapClient.conn' to null asynchronously then we will 
>>> hit NPE.
>>
>> OK - that's good to know. I agree that CommunicationException is
>> better than NPE. If so then not nulling the connection
>> is at least an improvement.
>> But this would deserve a comment in the code, I think.
>> More below...
>>
>>>> I wonder if some other mechanism could be used to reduce
>>>> and hopefully eliminate the time window in which the race
>>>> could appear. For instance taking the connection out of
>>>> the pool before closing it, instead of closing it before
>>>> taking it out of the pool, etc...
>>>>
>>> Changing order will not help, i think closing the  same connection 
>>> is not a problem  but setting `LdapClient.conn'  to null 
>>> asynchronously is causing NPE.
>>
>> Yes - I agree that setting conn to null is problematic.
>> I wonder why it's set to null. Maybe in an attempt to
>> make it eligible for GC earlier?
>>
>> If that's not the issue, then I would suggest making it
>> final (AFAICS it's only set to non-null once) and cleaning
>> up the code that tests for conn == null.
>>
>> If GC is the issue, then the best you can do is probably
>> only set it to null when the client is not pooled.
>>
>> And if we could reduce the time window in which a thread
>> could get a stale client from the pool that might be good
>> too?
>>
>> Is there anyway we could somehow add a test that reproduce
>> the NPE?
>>
>> best regards,
>>
>> -- daniel
>>
>> [...]
>>
>>>>>> On 24/08/18 11:35, vyom tewari wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Please review this simple fix below
>>>>>>>
>>>>>>> webrev: 
>>>>>>> http://cr.openjdk.java.net/~vtewari/8205330/webrev0.0/index.html
>>>>>>>
>>>>>>> bugid: https://bugs.openjdk.java.net/browse/JDK-8205330
>>>>>>>
>>>>>>> This fix will resolve the race in LdapClient  where we are 
>>>>>>> explicitly making "null" to LdapClient.conn.
>> [...]