RFR: 8259886 : Improve SSL session cache performance and scalability [v2]
Xue-Lei Andrew Fan
xuelei at openjdk.java.net
Thu Feb 4 19:40:42 UTC 2021
On Thu, 4 Feb 2021 19:17:21 GMT, djelinski <github.com+30433125+djelinski at openjdk.org> wrote:
>>> the benchmark performance improvement is a trade-off between CPU and memory, by keeping expired entries while putting a new entry in the cache
>>
>> Not exactly. The memory use is capped by cache size. The patch is a trade off between the cache's hit/miss ratio and CPU; we will get faster cache access at the cost of more frequent cache misses.
>>
>> All calls to `put()` remove expired items from the front of the queue, and never perform a full scan. `get()` calls shuffle the queue, moving the accessed item to the back. Compare this to original code where `put()` only removed expired items when the cache overflowed, and scanned the entire cache.
>> Let me give some examples.
>> **Example 1**: insertions at a fast pace leading to cache overflows and no expirations. Here the new implementation improves performance. Consider a cache with size=4, timeout=10, and the following sequence of events:
>> T=1, put(1);
>> T=2, put(2);
>> T=3, put(3);
>> T=4, put(4);
>> Cache contents after these calls (same in old and new scenario). Queue order: least recently accessed items on the left, most recently accessed on the right. _K_ denotes cache key, _exp_ denotes entry expiration time and is equal to insertion time _T_ plus timeout:
>>
>> |K=1, exp=11|K=2, exp=12|K=3, exp=13|K=4, exp=14|
>>
>> If we now add another item to the queue, it will overflow. Here's where the implementations behave differently, but the outcome is identical: old one scans the entire list for expired entries; new one improves performance by ending the search for expired entries after encountering the first non-expired entry (which is the first entry in the above example). The end result is the same in both cases - oldest (least recently accessed) item is dropped:
>> T=5, put(5)
>>
>> |K=2, exp=12|K=3, exp=13|K=4, exp=14|K=5, exp=15|
>>
>> **Example 2**: insertions at a moderate pace, with interleaved reads. Here the new implementation improves performance, but at a possible cost of wasting cache capacity on expired entries. Consider a cache with size=4, timeout=7, and the following sequence of events:
>> T=1, put(1);
>> T=3, put(3);
>> T=5, put(5);
>> T=7, put(7);
>> T=7, get(1);
>> Cache contents after these calls:
>>
>> |K=3, exp=10|K=5, exp=12|K=7, exp=14|K=1, exp=8|
>>
>> `get(1)` operation moved item with K=1 to the back of the queue.
>>
>> If we wait for item with K=1 to expire and then add another item to the queue, it will overflow. Here's where the implementations behave differently, and the outcome is different: old one scans the entire list for expired entries, finds entry with K=1 and drops it; new one gives up after first non-expired entry (which is the first entry), and drops the first entry.
>>
>> So, when we perform:
>> T=9, put(9);
>>
>> Old implementation will get:
>> |K=3, exp=10|K=5, exp=12|K=7, exp=14|K=9, exp=16|
>>
>> New implementation will get:
>> |K=5, exp=12|K=7, exp=14|K=1, exp=8(expired)|K=9, exp=16|
>>
>> Note that:
>> - an attempt to retrieve expired item (i.e. `get(1)`) will immediately remove that item from cache, making room for other items
>> - retrieving a non-expired item will move it to the back of the queue, behind all expired items
>>
>> **Example 3**: insertions at a slow pace, where most items expire before queue overflows. Here the new implementation improves memory consumption. Consider a cache with size=4, timeout=1, and the following sequence of events:
>> T=1, put(1);
>> T=3, put(3);
>> T=5, put(5);
>> T=7, put(7);
>> Every cache item is expired at then point when a new one is added. Old implementation only removes expired entries when cache overflows, so all entries will still be there:
>>
>> |K=1, exp=2(expired)|K=3, exp=4(expired)|K=5, exp=6(expired)|K=7, exp=8|
>>
>> New implementation removes expired entries on every put, so after the last put only one entry is in the cache:
>>
>> |K=7, exp=8|
>>
>> After another put the old implementation will encounter a cache overflow and remove all expired items.
>>
>> Let me know if that helped.
>>
>>> add two more types of benchmark: get the entries and remove the entries
>>
>> Both these operations are constant-time, both before and after my changes. Do you expect to see some oddities here, or do we just want a benchmark that could be used to compare other implementations?
>>
>>> increase the size to some big scales, like 2M and 20M
>>
>> Can do. Do you think it makes sense to also benchmark the scenario where GC kicks in and collects soft references?
>>
>>> it may change the behavior of a few JDK components
>>
>> Of all uses of Cache, only `SSLSessionContextImpl` (TLS session cache), `StatusResponseManager` (OCSP stapling) and `LDAPCertStoreImpl` (I'm not familiar with that one) set expiration timeout; when the timeout is not set, the behavior is exactly the same as before.
>> `StatusResponseManager` is constantly querying the same keys, and is liberally sized, so I don't expect much of an impact.
>> TLS session cache changes may result in fewer session resumptions and more full handshakes; I expect the cache performance improvement to more than offset the CPU cycles lost on full handshakes.
>>
>> How do I file a CSR?
>>
>> Also, what do you think about the changes done in `Do not invalidate objects before GC` 5859a03 commit? They offer a minor performance improvement, but if clearing the values before GC is an important security feature of this cache, I'm prepared to drop that commit.
>
> Added benchmarks for get & remove. Added tests for 5M cache size. Switched time units to nanoseconds. Results:
> Benchmark (size) (timeout) Mode Cnt Score Error Units
> CacheBench.get 20480 86400 avgt 25 62.999 ? 2.017 ns/op
> CacheBench.get 20480 0 avgt 25 41.519 ? 1.113 ns/op
> CacheBench.get 204800 86400 avgt 25 67.995 ? 4.530 ns/op
> CacheBench.get 204800 0 avgt 25 46.439 ? 2.222 ns/op
> CacheBench.get 5120000 86400 avgt 25 72.516 ? 0.759 ns/op
> CacheBench.get 5120000 0 avgt 25 53.471 ? 0.491 ns/op
> CacheBench.put 20480 86400 avgt 25 117.117 ? 3.424 ns/op
> CacheBench.put 20480 0 avgt 25 73.582 ? 1.484 ns/op
> CacheBench.put 204800 86400 avgt 25 116.983 ? 0.743 ns/op
> CacheBench.put 204800 0 avgt 25 73.945 ? 0.515 ns/op
> CacheBench.put 5120000 86400 avgt 25 230.878 ? 7.582 ns/op
> CacheBench.put 5120000 0 avgt 25 192.526 ? 7.048 ns/op
> CacheBench.remove 20480 86400 avgt 25 39.048 ? 2.036 ns/op
> CacheBench.remove 20480 0 avgt 25 36.293 ? 0.281 ns/op
> CacheBench.remove 204800 86400 avgt 25 43.899 ? 0.895 ns/op
> CacheBench.remove 204800 0 avgt 25 43.046 ? 0.759 ns/op
> CacheBench.remove 5120000 86400 avgt 25 51.896 ? 0.640 ns/op
> CacheBench.remove 5120000 0 avgt 25 51.537 ? 0.536 ns/op
Thank you for the comment. The big picture is more clear to me now.
> Example 2:
> Old implementation will get:
> |K=3, exp=10|K=5, exp=12|K=7, exp=14|K=9, exp=16|
>
> New implementation will get:
> |K=5, exp=12|K=7, exp=14|K=1, exp=8(expired)|K=9, exp=16|
K=3 is not expired yet, but get removed, while K=1 is kept. This behavior change may cause more overall performance hurt than improving the cache put/get performance. For example, it need to grab the value remotely. A full handshake or OCSP status grabbing could counteract all the performance gain with the cache update.
> All calls to put() remove expired items from the front of the queue, and never perform a full scan. get() calls shuffle the queue, moving the accessed item to the back. Compare this to original code where put() only removed expired items when the cache overflowed, and scanned the entire cache.
I think the idea that put() remove expired items from the front of the queue is good. I was wondering if it is an option to have the get() method that removed expired items until the 1st un-expired item, without scan the full queue and change the order of the queue. But there is still an issue that the SoftReference may have clear an item, which may be still valid.
In general, I think the get() performance is more important than put() method, as get() is called more frequently. So we should try to keep the cache small if possible.
>> increase the size to some big scales, like 2M and 20M
>
> Can do. Do you think it makes sense to also benchmark the scenario where GC kicks in and collects soft references?
In the update, the SoftReference.clear() get removed. I'm not sure of the impact of the enqueued objects any longer. In theory, it could improve the memory use, which could counteract the performance gain in some situation.
> Also, what do you think about the changes done in Do not invalidate objects before GC 5859a03 commit?
See above, it is a concern to me that the soft reference cannot be cleared with this update.
> How do I file a CSR?
Could you edit the bug: https://bugs.openjdk.java.net/browse/JDK-8259886? In the more drop down menu, there is a "Create CSR" option. You can do it if we have an agreement about the solution and impact.
-------------
PR: https://git.openjdk.java.net/jdk/pull/2255
More information about the security-dev
mailing list