[10] RFR : 8186628 : SSL session cache can cause a scalability bottleneck

Mon Dec 4 15:52:34 UTC 2017

Hi Ivan,

On 11/23/2017 01:16 AM, Ivan Gerasimov wrote:
> Hi Peter!
>
> Thank you very much for looking into this!
>
>
> On 11/22/17 1:45 AM, Peter Levart wrote:
>> Hi Ivan,
>>
>> Here's my attempt to increase multithreaded scalability of Cache:
>>
>> http://cr.openjdk.java.net/~plevart/jdk10-dev/8186628_ssl_session_cache_scalability/webrev.01/ 
>>
>>
>> Haven't tested this yet, but I thought that since you already have 
>> relevant performance tests, you might want to try this, so I decided 
>> to release it as is.
> I plugged your implementation into the benchmark I had [*] and got 
> these numbers for the throughput:
>
> Original MemoryCache: 4,116,537 puts/second
> Your variant:  2,942,561 puts/second
>
> So, in the tested scenario performance downgraded by 28%.

Hm. What kind of benchmark did you have? Can you share the source? Was 
this multi-threaded concurrent puts or single-threaded? I'm surprised 
about the results since I did some benchmarking myself and got entirely 
different results. Here's what I did:

http://cr.openjdk.java.net/~plevart/jdk10-dev/8186628_ssl_session_cache_scalability/bench/

You can download the whole directory and just setup the JMH project over 
it. The main class is CacheBench. It uses a typical caching scenario:

     @Threads(4)
     @Benchmark
     public Object original() {
         Integer key = ThreadLocalRandom.current().nextInt(keyspan);
         Object val = origCache.get(key);
         if (val == null) {
             origCache.put(key, val = Boolean.TRUE);
         }
         return val;
     }

The cache hit-ratio is varied by adjusting cache size and/or the range 
of distinct key values produced by RNG. With this setup I get the 
following results:

Benchmark            (hitratio)  (size)  (timeout)  Mode Cnt      
Score     Error  Units

CacheBench.original           0    1000      86400  avgt 10  18023,156 
Â± 434,710  ns/op
CacheBench.original           5    1000      86400  avgt 10  17645,688 
Â± 767,696  ns/op
CacheBench.original          50    1000      86400  avgt 10  10412,352 
Â± 216,208  ns/op
CacheBench.original          95    1000      86400  avgt 10   2275,257 
Â± 449,302  ns/op
CacheBench.original         100    1000      86400  avgt 10    542,321 
Â±  15,133  ns/op

CacheBench.patched1           0    1000      86400  avgt 10   1685,499 
Â±  24,257  ns/op
CacheBench.patched1           5    1000      86400  avgt 10   1633,203 
Â±  15,045  ns/op
CacheBench.patched1          50    1000      86400  avgt 10   1061,596 
Â±  12,312  ns/op
CacheBench.patched1          95    1000      86400  avgt 10    125,456 
Â±   3,426  ns/op
CacheBench.patched1         100    1000      86400  avgt 10     47,166 
Â±   3,140  ns/op

patched1 is mostly what I had in my proposed code but with a fix in 
put() method which could install an entry into the cacheMap after some 
other thread already removed it from the DL-list. The race is prevended 
by 1st installing a unique reservation value into the cacheMap, then 
adding the entry to the DL-list and afterwards trying to replace the 
reservation value with the entry and cleaning up if not successful.

If this is not enough, I also tried an approach with lock-free DL-list. 
Basically a modified ConcurrentLinkedDequeue where intermediary nodes 
may be removed in constant time (important for fast removal of elements 
that become soft-reachable). This variant is even more scalable than 
lock-based DL-list:

CacheBench.patched2           0    1000      86400  avgt   10 774,356 
Â±   4,113  ns/op
CacheBench.patched2           5    1000      86400  avgt 10    754,222 
Â±   2,956  ns/op
CacheBench.patched2          50    1000      86400  avgt 10    434,969 
Â±   2,359  ns/op
CacheBench.patched2          95    1000      86400  avgt 10     98,352 
Â±   0,750  ns/op
CacheBench.patched2         100    1000      86400  avgt 10     47,692 
Â±   0,995  ns/op

Here's what such MemoryCache looks like in the JDK:

http://cr.openjdk.java.net/~plevart/jdk10-dev/8186628_ssl_session_cache_scalability/webrev.02/

I took the ConcurrentLinkedDequeue by Doug Lea and Martin Buchholz and 
removed all public methods keeping just the delicate internal mechanics. 
The Node is an interface and can be implemented by different classes. 
BasicNode and SoftReferenceNode are provided (and are subclassed in 
MemoryCache). Public methods are tailored to Cache use case and deal 
with Node<E>(s) rather than E(lement)s.

>
> Still, I think it's important to try to improve performance of the 
> MemoryCache and, if possible, remove the points of contention.

So what do you think of the benchmark results?

>
>>
>> Rough sketch of changes:
>> - replaced LinkedHashMap with ConcurrentHashMap
>> - CacheEntry(s) are additionaly linked into a double-linked list. 
>> DL-list management is the only synchronization point (similar to 
>> Cleaner API) and the rule is: 1st new entry is linked into DL-list, 
>> then it is put into map - published. The same with removing: 1st an 
>> entry is unlinked from DL-list, then if successfull, removed from map 
>> and invalidated.
>> - changed the way expiry is handled so that the whole list is never 
>> needed to be scanned.
>>
>> The code speaks for itself.
>>
> Your implementation is somewhat similar to what I had tried before 
> coming up with proposal of the option to turn the cache off.
> I used ConcurrentHashMap + ConcurrentLinkedQueue to maintain the FIFO 
> order, still the performance was a few percent less then the one of 
> the original implementation.

Removing an intermediary element from ConcurrentLinkedQueue takes O(n) 
time. This is needed in emptyQueue() for all sofly-reachable-and-cleared 
entries and for remove() when application removes entry explicitly from 
cache (for example on SSL session termination). We need double-linked 
list (like ConcurrentLinkedDequeue) with constant-time removal of 
intermediary node(s).

>
> That's really why I decided to split the issue:
> - first, provide an option to turn the cache off (that should be 
> easily backported and can provide immediate relief to the customers 
> that experience the scalability bottleneck;
> - second, continue work on the cache implementation improvements.

Maybe with a scalable Cache such switch will not be needed. I can 
imagine that such switch provides relief for kinds of loads that may 
happen in very special circumstances (like DOS attacks for example that 
don't re-use SSL sessions), but in normal operation such switch would 
degrade performance wouldn't it?

The problem I see is that if you provide such switch now, users may set 
it and later when they upgrade to newer JDK with scalable cache, don't 
realize that they don't need it any more and consequently unnecessarily 
keep paying for it in the future...

Regards, Peter

>
>> Let me know if you find it usefull and/or it solves the scalability 
>> bottleneck.
>>
> Yes, I think it's very useful!
>
> However, as I wrote above, I think that the issue needs be split into 
> two parts: an option to turn the caching off (which can be easily 
> backported) and improving the cache implementation (which can even 
> relax the requirements, as the FIFO order or absolutely hard upper 
> bound of the cache size).
>
> With kind regards,
> Ivan