[Update]: JEP 249 (OCSP Stapling for TLS)

Thu Jul 2 17:25:30 UTC 2015

On 7/2/2015 9:43 AM, Xuelei Fan wrote:
> On 7/2/2015 10:26 PM, Jamil Nimeh wrote:
>> On 07/02/2015 05:05 AM, Xuelei Fan wrote:
>>> sun/security/ssl/ServerHandshaker.java
>>> ======================================
>>> OCSP stapling only used for certificate-based server authentication at
>>> present.  I was wondering,  may be better to make a check before wrap
>>> the ServerHello OCSP extension and CertificateStatus message that
>>> Certificate message would be actually used.
>>>
>>> -------------------
>>>    860    for (CertStatusReqItemV2 csriv2 :
>>>    861        statReqExtV2.getRequestItems()) {
>>>    862        // Favor OCSP_MULTI over OCSP
>>>    863        statReqType = csriv2.getType();
>>>    864        statReqData = csriv2.getRequest();
>>>    865        if (statReqType == StatusRequestType.OCSP_MULTI) {
>>>    866            break;
>>>    867        }
>>>    868     }
>>>
>>> Client's preference (favorite choice first) should be respected. For
>>> type OCSP, line 863-864 prefer the last item rather than the client
>>> preference.  And line 863-864 does not consider whether the
>>> OCSPStatusRequest is suitable for this server or not, just select the
>>> last one.  As may result in ocsp stapling failure if the selected item
>>> is not for the right responder.
>> We talked about this back when the original design was done.  Back then
>> I had gone with client-ordering as the selection criteria and the
>> feedback I got was to favor ocsp_multi for those clients that can do it,
>> since it gives a more complete set of revocation information.  While the
>> client is ordering it's preference, I don't see in the spec any area
>> that says the server has to strictly adhere to the client preference.
>> If we want to favor ocsp_multi as a server we should be able to do that
>> legally.  As for the discussion where we talked about this, I don't
>> recall if it was a conversation or in email.  I'll have to check.  If
>> folks feel strongly about strictly honoring client ordering I can make
>> the change.
>>
> I think we are not talking about the same thing.
>
> Per RFC 6961:
>
>     The items in the list of CertificateStatusRequestItemV2 entries are
>     ordered according to the client's preference (favorite choice first).
>
> Considering the request list:
>     ocsp-1, ocsp-2, ocsp_multi-1, ocsp_multi-2
>
> I think we would want to select ocsp_multi-1, it's the expected
> behavior.  We prefer ocsp_multi.  It's OK.
Oh!  I see what you mean about not honoring the preference now.  The 
current implementation in this particular case would select 
ocsp_multi-1, since we break from the loop on first detection of ocsp_multi.
> Considering one more request list:
>     ocsp-1, ocsp-2
>
> I think the current implementation would select ocsp-2, rather than
> ocsp-1.  That's one of my concerns.  This update should be simple.
In this case you are correct, we would select ocsp-2.  So that's 
definitely inconsistent behavior given your previous example.  I will 
fix this to select the first instance of each type, but allow ocsp_multi 
to take precedence over ocsp.
>
> Let's consider one more example, the server cert is issued by Verisign
> Class 3.  The request list looks like:
>
>     ocsp_multi-1 (for Entrust OCSP responder),
>     ocsp_multi-2 (for Verisign),
>     ocsp_multi-3 (for Verisign Class 3),
>     ocsp_multi-4 (for GeoTrust)
>
> We are expected to use ocsp_multi-3 for the server cert.  I think your
> implementation would use ocsp_multi-1 (for Entrust) as the OCSP
> responder for Verisign Class 3 server cert.  That's another one of my
> concerns.
In a case like this, I think the distinguishing factor between those 
ocsp_multi ItemV2s is one or more ResponderIds.  Since we're not 
supporting server-side ResponderId selection in this first release I 
don't think we can distinguish between these.  I don't know how much of 
a real-world case this is, though when we do support ResponderId 
selection we would definitely need to handle this properly.
>
>
>> The current code snippet doesn't strictly prefer the last item.  If the
>> first item is OCSP_MULTI, then it will stop right there, but your point
>> is still true that we're not going on client ordering (as explained above).
>>> Line 865-867 does not consider whether the OCSPStatusRequest is suitable
>>> for this server or not, either.
>>>
>>> In general, client knows nothing about server certificates.  So client
>>> would offer responders it trusted.  The responders is not necessary
>>> suitable for the target server.  The server side should select a
>>> suitable one.  It is not necessary the first one or the last one.  See
>>> also section 2.2, RFC 6961:
>>>
>>>     "Servers that support a client's selection of responders using the
>>>      ResponderID field could implement this selection by matching the
>>>      responder ID values from the client's list with the ResponderIDs of
>>>      known OCSP responders, either by using a binary compare of the values
>>>      or a hash calculation and compare method."
>>>
>>> If the update of your implementation is pretty significant or it is
>>> unclean how we should move forward at present, I'm OK if you want to
>>> remove the support of client specified responders.
>> As implied from the above paragraph servers aren't required to support
>> client ResponderId selection.  I don't have support for this feature
>> yet, neither from the client or the server.  The OCSPStatusRequest has
>> the hooks in it to populate ResponderIds because it was fairly easy to
>> do.  I would like to leave that code in, since providing that capability
>> to the client is something I want to do in the future, as well as handle
>> the server selection based on Responder Id.
> See above.
>
>>> sun/security/ssl/StatusResponseManager.java
>>> ===========================================
>>> line 207-213:
>>>
>>> // It is assumed that the caller has ordered the certs in the chain
>>> // from end-entity to trust anchor.  In order to create the CertId
>>> // necessary for OCSP requests the subject and issuer certs must
>>> // be present.
>>> - if (chain.length < 2) {
>>> + if (chain.length == 0) {
>>>       return Collections.emptyMap();
>>> }
>>>
>>> This assumption is not actually correct in general.  The trust anchor is
>>> not necessarily be included in the Certificate message.  See page 48,
>>> section 7.4.2, RFC 5246.
>> I believe this chain comes in from the KeyManager via
>> ServerHandshaker.setupPrivateKeyAndChain(), so it should be a complete
>> cert chain in the correct order (or as complete as the server can do, I
>> suppose).  The Certificate message is not a factor here.
> It may be a complete cert chain.  And a complete cert chain may not
> contain the trust anchor.  It depends on the key manager implementation.
>   Our implementation may provider the trust anchor, but others may not.
Fair enough.  In order to be able to craft an OCSP request for a cert 
you need the issuer key (otherwise you cannot make a CertId). So are you 
suggesting we try to build a certpath based on the last cert found in 
the "certs" field (set as part of setupPrivateKeyAndChain?)  Assuming it 
is not a trust anchor, I would think some kind of certpath would be 
returned and we could hopefully get the issuer cert from that.
>
>>> ----------------------------
>>>    436         List<ResponderId> responderIds;
>>>
>>> You don't actually use the client specified responders, do you?  It's OK
>>> to me if we don't support none-empty OCSPStatusRequest.  If we do not
>>> want to do that, I would suggest to remove the related implementation
>>> code for that.  As would simplify the job, I think.
>> No, not today.
> It's OK to me.
>
>> In the future we will.
> IMHO, I don't find the necessary to support this feature in the near
> future.  Maybe we can do it when we have some free cycles.
>
>> I don't believe Apache does
>> ResponderId selection either.  Not sure about other implementations.  I
>> know Apache will handle extension forwarding, which we also do (and can
>> opt not to do  with a property).  I don't want to remove the code
>> though.  Those are the hooks for the feature in the future.  Same for
>> the OCSPStatusRequest...it has a constructor that allows population of
>> extensions and ResponderIds. I'd like to leave this in, even if we're
>> not populating those fields from the client today.
> I would suggest remove the source code.  You can have the history in a
> webrev.  In an official product, we'd better not having unused code.
>
>>> -----------------------------
>>>    252    // Set a bunch of threads to go do the fetching
>>>    253    List<Future<StatusInfo>> resultList =
>>>    254        threadMgr.invokeAll(requestList, delay, unit);
>>>
>>> The current implementation of OCSP response getter, looks like for every
>>> cert in the chain, a fetch thread would be created to get the
>>> corresponding OCSP response.  In the fetch thread, if the response is
>>> cached, use it; otherwise, connect to OCSP server and fetch it.
>>>
>>> Performance would be a very big concern.  A connection may kick start
>>> many threads.  Fetch threads may be a bottleneck of the server. In
>>> general, a server may be expected to serve 10K+ connections, but may
>>> only be able to hit around 10K- threads.  It's likely to face DoS
>>> challenges if every ClientHello trigger 1+ fetch threads.
>> If a Server has to hit that many connections simultaneously, it's
>> probably already doing multi-threading to accomplish that goal, wouldn't
>> you think?  If it cannot hit 10K threads to begin with, it seems to me
>> like some system-level tuning needs to be done.
> We may be not on the same page again.  I was talking a threshold
> benchmark: how much clients a server can serve at the same time.
No, I understood you.
>
> Let's see an example, suppose the cert chain contains 3 certs, each
> connection takes one thread.  The system is able to hit 10K thread.
> Before your implementation, the system can accept 10K clients at the
> same time.  With the implementation, because each cert may open a
> thread, the system can only accept 2.5K clients.  Very bad performance
> impact.
>
>> Second, a new thread
>> isn't created for each client connection.  This is a thread pool, so a
>> core number of threads are created when the SRM is created, new ones are
>> made as needed, and old ones are reused whenever their task is finished
>> and goes into an idle state.  So we not always making a new thread.
>>
> I understand, but the threshold are get impacted.  Not much impact on
> spare system, but bad performance impact on busy system.
>
>> Further, time to bring the response back to the caller from the cache is
>> very fast, between 2-5 msec.  So there would most likely be some thread
>> reuse on a very busy server.  That very busy server is not going to be a
>> single-core/single-thread machine; it will be capable of servicing
>> multiple threads at the same time, so that should help aggregate
>> throughput.
>>
> Not enough.  10K client may request the connection at the same time, and
> the management of a large butch of threads (10K) is a big load for every
> system.  That's why AIO programming popular today.  We support AIO
> programming via SSLEngine, need to consider the requirements.
Hmmm...I see what you're saying.  I'm going to have to think on this one 
a bit and see what can be done in the SRM.
>
>> It of course takes significantly longer to bring the response back from
>> the network.  I didn't get exact numbers, but in an environment where
>> the responders are on the same machine as the server I could get them
>> back in about 150msec.  So in extremely rough terms it's network IO time
>> + 150msec for each fetch.
>>> Another performance concern is that when there is no cache, the client
>>> would have to wait for the server to fetch the response from OCSP
>>> servers.  It's easy to run into client timeout issue.
>> As clients do with other implementations.  If the server doesn't have
>> the response, it has to get it and the client has to wait for it.  I see
>> no problem here.
> No problem here.  The problem is we may want to improve the performance
> if we could.
>
>> The connection delay isn't technically the time taken
>> to retrieve all responses, it is the time taken to retrieve all
>> responses or the server fetch timeout, whichever is smaller.  A server
>> can opt to wait 500 msec, 1000msec, or whatever the admin thinks is
>> appropriate given the distance of the responders and the client load.
>> If the timer is exceeded, the server will include any responses it has
>> and will not include the ones it is still retrieving.  If it cannot get
>> any, then it doesn't even assert status_requst[_v2].
>>
> 10 seconds is the OCSP service availability per Baseline Requirements of
> CA/Browser Forum.  If 200 msec is used, it means the OCSP stapling of
> the server may be not reliable as the OCSP server may not response yet
> when timeout.
200msec was just an example.  It's the admin's call what they want to 
set for their environment.  If there's a cache hit, even 200msec is 
plenty of time for the responses to be returned.  If it's a cache miss 
then the fetches will be initiated and eventually those responses will 
be received and cached, but for that connection at least no responses 
will be available by the time the server has to move on with the 
handshaking process.  If admins want to hold up a connection longer they 
can alter that property.

I don't know that there's a fool-proof way to get around that.  Even at 
start of day, there would be a window of time where the server is 
waiting for initial responses and clients could be connecting and no 
responses would be available.

Just out of curiosity, the CA/Browser foum's recommendation of 10 
seconds is between a client and OCSP responder...
>
>> What's more, even if the server has to get 3 responses, but only has one
>> by the timeout threshold the remaining two fetches will continue in the
>> background and complete, even if the connection has gone on. This way a
>> future connection coming in has a greater chance of getting a cache hit.
> It's a kind of unreliable for the present client.  We may want to make a
> improvement if possible.
>
> Thanks,
> Xuelei
>
>>> I may suggest to use a single thread to get responses from OCSP servers.
>>>    And any TLS connection would only get responses from cache.  The
>>> scenarios may looks like:
>>> 1. Per every SSLContext instance, create a scheduled single thread to
>>> fetch the response from OCSP servers, and cache them for each server
>>> certificate chains.  The server certificates can be got from key manager.
>>> 2. For every client request, look for the response from the cache.
>>> 3. if responder is not cached and if we want to support none-empty
>>> OCSPStatusRequest, fetch the response from OCSP servers in the same
>>> thread, and cache the result.  We may also want to add the task to the
>>> scheduled thread in #1.  Not create new thread for every connection is
>>> important for SSLEngine in some circumstances.
>>>
>>> #3 is also not performance friendly.  I think none-empty
>>> OCSPStatusRequest is not common in practice.
>> I'll have to think about your proposal a bit.  It's a significant change
>> from what I'm doing, but it is an interesting alternative.
>>
>> As stated above, we don't create new threads for each connection.
>> Threads are reused when they are idle/available.
>>
>> You are correct, non-empty OCSPStatusRequests are rare.  I've never seen
>> a client do one.  I've only looked at browsers, but none of them
>> populate the request.
>>> I will stop here for the first round code review.  Looking forward for
>>> the next round webrev.
>>>
>>> Thanks,
>>> Xuelei
>>>
>>>
>>> On 6/27/2015 11:06 PM, Jamil Nimeh wrote:
>>>> Hello all, I've posted an updated webrev based on comments I've received
>>>> so far:
>>>>
>>>> http://cr.openjdk.java.net/~jnimeh/reviews/8046321/webrev.1
>>>>
>>>> Thanks,
>>>> --Jamil
>>>>
>>>> On 06/18/2015 05:27 PM, Jamil Nimeh wrote:
>>>>> Hello all,
>>>>>
>>>>> I have a first cut at the OCSP stapling webrev posted for your review:
>>>>>
>>>>> JEP:https://bugs.openjdk.java.net/browse/JDK-8046321
>>>>> Webrev:http://cr.openjdk.java.net/~jnimeh/reviews/8046321/webrev.0/
>>>>>
>>>>> A couple items to note:
>>>>>
>>>>>     * I'm in the process of updating the JEP with some more details.  I
>>>>>       should be done with these changes by tonight (PDT).
>>>>>     * Missing are some of the TLS end-to-end tests.  These tests have
>>>>>       been coded and run outside the jtreg framework, but for some
>>>>>       reason things hang in jtreg.  I've included some of the supporting
>>>>>       classes that these tests will use (CertificateBuilder.java and
>>>>>       SimpleOCSPResponder.java) so folks could review those if they're
>>>>>       interested.  I will update the webrev and notify the list as soon
>>>>>       as I've got the tests working in jtreg.
>>>>>
>>>>> Thanks to everyone who has helped along the way.
>>>>>
>>>>> --Jamil
>>>>>
>>>>>