RFR: JDK-8031043 ClassValue's backing map should have a smaller initial size

Thu Jun 9 14:42:59 UTC 2016

Hi,

The patch for this issue is a result of the following discussion on 
mlvm-dev list:

http://mail.openjdk.java.net/pipermail/mlvm-dev/2016-May/006602.html

I propose the following improvement to the implementation of ClassValue API:

http://cr.openjdk.java.net/~plevart/misc/ClassValue.Alternative2/webrev.04.3/

This is a re-implementation of ClassValue API that leverages a 
ConcurrentMap for storage. ConcurrentHashMap was initially tested, but 
using it regressed lookup performance for about ~10%. So a simple 
ConcurrentMap implementation (LinearProbeHashtable) was created which 
has smaller footprint, better CPU cache locality and consequently faster 
lookup performance for the task at hand. The goal of this task was to 
shrink the memory footprint of ClassValue API. To see the end effect 
let's first look at the results of a benchmark that simulates a 
re-deployment of an application in an app server where several apps are 
deployed and each uses it's own set of classes and ClassValue(s):

http://cr.openjdk.java.net/~plevart/misc/ClassValue.Alternative2/ClassValueExpungeBench.java

Benchmark                                 (classValuesPerPart) 
(classesPerPart)  (impl)  (partitions)  Mode  Cnt     Score Error  Units
ClassValueExpungeBench.redeployPartition 8              1024    
jdk9            16    ss   16    65.682 ± 1.485  ms/op
ClassValueExpungeBench.redeployPartition 8              4096    
jdk9            16    ss   16   247.040 ± 7.684  ms/op
ClassValueExpungeBench.redeployPartition 64              1024    
jdk9            16    ss   16   302.536 ± 27.750  ms/op
ClassValueExpungeBench.redeployPartition 64              4096    
jdk9            16    ss   16  1174.002 ± 77.183  ms/op

Benchmark (classValuesPerPart)  (classesPerPart)  (impl)  (partitions) 
Mode  Cnt    Score   Error  Units
ClassValueExpungeBench.redeployPartition 8              1024  
pl04.3            16    ss   16   47.179 ± 1.436  ms/op
ClassValueExpungeBench.redeployPartition 8              4096  
pl04.3            16    ss   16  163.067 ± 8.118  ms/op
ClassValueExpungeBench.redeployPartition 64              1024  
pl04.3            16    ss   16   67.581 ± 1.718  ms/op
ClassValueExpungeBench.redeployPartition 64              4096  
pl04.3            16    ss   16  240.458 ± 6.616  ms/op

A by-product of this simulation is a heap dump histogram taken after the 
last warmup iteration when all "applications" are deployed:

top-10 classes heap dump for 64 ClassValue(s) x 4096 Class(es) x 16 
partitions

jdk9:

  num     #instances         #bytes  class name (module)
-------------------------------------------------------
    1:       4194309      167772360 java.util.WeakHashMap$Entry 
(java.base at 9-internal)
    2:       4195329      134250528 java.lang.ClassValue$Entry 
(java.base at 9-internal)
    3:         65539       34603248 [Ljava.util.WeakHashMap$Entry; 
(java.base at 9-internal)
    4:         65537       34603032 [Ljava.lang.ClassValue$Entry; 
(java.base at 9-internal)
    5:         67301        7552448  java.lang.Class (java.base at 9-internal)
    6:         65536        4194304 java.lang.ClassValue$ClassValueMap 
(java.base at 9-internal)
    7:         65552        2097664 java.lang.ref.ReferenceQueue 
(java.base at 9-internal)
    8:         67736        1676344  [Ljava.lang.Object; 
(java.base at 9-internal)
    9:         66349        1116848  [I (java.base at 9-internal)
   10:         65554        1048864 java.lang.ref.ReferenceQueue$Lock 
(java.base at 9-internal)
-------------------------------------------------------
                          388915640 == 370 MiB

pl04.3:

  num     #instances         #bytes  class name (module)
-------------------------------------------------------
    1:        133274       69833848  [Ljava.lang.Object; 
(java.base at 9-internal)
    2:         67300        7552376  java.lang.Class (java.base at 9-internal)
    3:         65536        2097152 java.lang.ClassValue$ClassValueMap 
(java.base at 9-internal)
    4:         65536        2097152 
java.lang.ClassValue$ClassValueMap$WeakNode (java.base at 9-internal)
    5:         66349        1116848  [I (java.base at 9-internal)
    6:         65991        1055856  java.lang.Object (java.base at 9-internal)
    7:          7434         696584  [B (java.base at 9-internal)
    8:          1551         299680  [Ljava.lang.Class; 
(java.base at 9-internal)
    9:          2447         176184  java.lang.reflect.Field 
(java.base at 9-internal)
   10:          7154         171696  java.lang.String (java.base at 9-internal)
-------------------------------------------------------
                           85097376 == 81 MiB

Footprint is reduced for more than 4 times. The main improvement of this 
implementation is that to support M Class(es) x N ClassValue(s), the 
only part of the data structure that has O(M*N) space footprint are the 
backing array(s) of LinearProbeHashtable. No less than 3 and no more 
than 6 Object slots are used per Class x ClassValue association (4 in 
average). Other supporting parts of the data structure are either O(M), 
O(N) or O(1).

The lookup time has improved too with this patch (slightly):

http://cr.openjdk.java.net/~plevart/misc/ClassValue.Alternative2/ClassValueBench.java

http://cr.openjdk.java.net/~plevart/misc/ClassValue.Alternative2/webrev.04.3.bench_results.pdf

As to the LinearProbeHashtable, it has been updated from initial version 
to implement the whole ConcurrentMap interface. Not much was needed for 
that (~150 lines of heavy commented code) comparing to the 
implementation in webrev.04.2 
(http://cr.openjdk.java.net/~plevart/misc/ClassValue.Alternative2/webrev.04.2/). 
The end effect was that I was able to test the correctness of the 
implementation using MOAT tests. Other changes to LinearProbeHashtable 
from webrev.04.2 (for those following the discussion on mlvm-dev) are:

     - the lookup logic that was repeated in each and every method was 
factored-out into a lookup() private method.
     - further benchmarking revealed that is was not beneficial to have 
a 'skip' field in tombstone objects. Reading this field to optimize 
skipping over a consecutive run of tombstones reduces cache-locality of 
algorithm. On average it is faster to just skip one tombstone at a time 
and use a singleton tombstone object to be able to compare just the 
reference.
     - The following spin loop was added to get() method (and similar to 
some other methods) to preserve SC properties of the methods:

     public V get(Object key) {
         Object[] tab = table;
         int i = lookup(tab, key);
         if (i >= 0) {
             Object value = getVolatile(tab, i + 1);
             if (value != null) {
                 return (V) value;
             }
             // possible race with removing an entry can cause oldValue
             // to be observed cleared. If this happens, we know the key
             // has/will be tombstoned just after that but we must wait
             // for that to happen in order to preserve SC properties of
             // this method; without this wait one could even observe the
             // following in one thread:
             //
             //      table.get(k) == null && table.containsKey(k);
             //
             while (getVolatile(tab, i) != TOMBSTONE) {
                 Thread.onSpinWait();
             }
         }
         return null;
     }

...the write this spin-loop is waiting on is in the remove() method(s):

                         tab[i + 1] = null; // clear the value...
                         tombstoned++;
                         putVolatile(tab, i, TOMBSTONE); // ...publish 
tombstone

The lookup performance of LinearProbeHashtable was put to test comparing 
it with ConcurrentHashMap in the following benchmark:

http://cr.openjdk.java.net/~plevart/misc/LinearProbeHashTable/lpht_src.tgz

Results:

http://cr.openjdk.java.net/~plevart/misc/LinearProbeHashTable/lpht_bench.pdf

In this benchmark, several variations are tested. The one from 
webrev.04.2 with stateful tombstones is called LinearProbeHashTable and 
the one with singleton stateless tombstone, presented in this RFR as 
webrev.04.3 is called LinearProbeHashTable1. The benchmark shows that 
even during concurrent removals and insertions, the successful lookup 
performance (GetExistent) does not degrade much and is on-par with 
ConcurrentHashMap. This benchmark also explores the suitability of using 
LinearProbeHashTable for keys derived from MethodType.

A note to reviewers: It is useless to compare the old implementation 
with the new one side by side as they have not much in common (just 
public API and javadocs). Although the net line count increases with 
this patch, the complexity of ClassValue implmenetation is reduced, 
because it is a layered implementation. ClassValue API is implemented in 
terms of a well-known ConcurrentMap API and correctness of ConcurrentMap 
implementation is verified by MOAT tests.

Regards, Peter