RFR: 8247532: Records deserialization is slow

Tue Jun 23 07:49:50 UTC 2020

Hi Chris, Claes,

Ok then, here's with benchmark included:

http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.07/

I haven't been able to run the benchmark with "make test" though. I have 
tried various ways to pass javac options to build like:

make test TEST='micro:org.openjdk.bench.java.io.RecordDeserialization' TEST_OPTS="VM_OPTIONS=--enable-preview --release=16"

...but javac doesn't seem to get them. Is there some secret option to 
achieve that?

Otherwise, I simulated what would happen when there are more then 10 
ObjectStreamClass layouts for same class rapidly interchanging, so that 
they push each other out of the cache, by temporarily setting cache's 
MAX_SIZE = 0. Note that this is worst case scenario:

Benchmark                                      (length)  Mode  Cnt    Score    Error  Units
RecordDeserializationBench.deserializeClasses        10  avgt   10    9.393 ±  0.287  us/op
RecordDeserializationBench.deserializeClasses       100  avgt   10   35.642 ±  0.977  us/op
RecordDeserializationBench.deserializeClasses      1000  avgt   10  293.769 ±  7.321  us/op
RecordDeserializationBench.deserializeRecords        10  avgt   10   15.335 ±  0.496  us/op
RecordDeserializationBench.deserializeRecords       100  avgt   10  211.427 ± 11.908  us/op
RecordDeserializationBench.deserializeRecords      1000  avgt   10  990.398 ± 26.681  us/op

This is using JMH option '-gc true' to force GC after each iteration of 
benchmark. Without it, I get a big (~4s) full-GC pause just in the 
middle of a run with length=100:

Iteration   1: 528.577 us/op
Iteration   2: 580.404 us/op
Iteration   3: 4438.228 us/op
Iteration   4: 644.532 us/op
Iteration   5: 698.493 us/op
Iteration   6: 800.738 us/op
Iteration   7: 929.791 us/op
Iteration   8: 870.946 us/op
Iteration   9: 863.416 us/op
Iteration  10: 916.508 us/op

...so results are a bit off because of that:

Benchmark                                      (length)  Mode  Cnt     Score      Error  Units
RecordDeserializationBench.deserializeClasses        10  avgt   10     8.263 ±    0.043  us/op
RecordDeserializationBench.deserializeClasses       100  avgt   10    33.406 ±    0.160  us/op
RecordDeserializationBench.deserializeClasses      1000  avgt   10   287.595 ±    0.960  us/op
RecordDeserializationBench.deserializeRecords        10  avgt   10    15.270 ±    0.080  us/op
RecordDeserializationBench.deserializeRecords       100  avgt   10  1127.163 ± 1771.892  us/op
RecordDeserializationBench.deserializeRecords      1000  avgt   10  2003.235 ±  227.159  us/op

Yes, there is quite a bit of GCing going on when cache is thrashing. 
Note that I haven't tuned GC in any way and I'm running this on a 
machine with 64GiB of RAM so heap is allowed to grow quite big and G1 is 
used by default I think.

This is still no worse than before the patch:

Benchmark                                 (length)  Mode  Cnt     Score    Error  Units
RecordDeserialization.deserializeClasses        10  avgt   10     8.382 :  0.013  us/op
RecordDeserialization.deserializeClasses       100  avgt   10    33.736 :  0.171  us/op
RecordDeserialization.deserializeClasses      1000  avgt   10   271.224 :  0.953  us/op
RecordDeserialization.deserializeRecords        10  avgt   10    58.606 :  0.446  us/op
RecordDeserialization.deserializeRecords       100  avgt   10   530.044 :  1.752  us/op
RecordDeserialization.deserializeRecords      1000  avgt   10  5335.624 : 44.942  us/op

... since caching of adapted method handle for multiple objects withing 
single stream is still effective.

I doubt there will ever be more than 10 variants/versions of the same 
record class deserialized by the same VM and in rapid succession, so I 
think this should not be an issue. We could add a system property to 
control the MAX_SIZE of cache if you think it is needed.

Regards, Peter

On 6/22/20 1:04 AM, Claes Redestad wrote:
> Hi Peter,
>
> patch and results look great!
>
> My only real comment on this is that I think the microbenchmark would be
> a valuable contribution, too.
>
> It'd also be interesting to explore how poor performance would become if
> we'd hit the (artificial) 11 layouts limit, e.g, by cycling through
> 10, 11, or 12 different shapes.
>
> /Claes
>
> On 2020-06-21 19:16, Peter Levart wrote:
>> Hi,
>>
>>
>> When re-running the benchmark [1] with different lengths of 
>> serialized arrays of records, I found that, compared to classical 
>> classes, lookup into the cache of adapted method handles starts to 
>> show when the length of array is larger (# of instances of same 
>> record type deserialized in single stream). Each record deserialized 
>> must lookup the method handle in a ConcurrentHashMap:
>>
>>
>> Benchmark                                    (length)  Mode Cnt    
>> Score   Error  Units
>> RecordSerializationBench.deserializeClasses        10  avgt 10    
>> 8.088 ± 0.013  us/op
>> RecordSerializationBench.deserializeClasses       100  avgt 10   
>> 32.171 ± 0.324  us/op
>> RecordSerializationBench.deserializeClasses      1000  avgt 10  
>> 279.762 ± 3.072  us/op
>> RecordSerializationBench.deserializeRecords        10  avgt 10    
>> 9.011 ± 0.027  us/op
>> RecordSerializationBench.deserializeRecords       100  avgt 10   
>> 33.206 ± 0.514  us/op
>> RecordSerializationBench.deserializeRecords      1000  avgt 10  
>> 325.137 ± 0.969  us/op
>>
>>
>> ...so keeping the correctly shaped adapted method handle in the 
>> per-serialization-session ObjectStreamClass instance [2] starts to 
>> make sense:
>>
>>
>> Benchmark                                    (length)  Mode Cnt    
>> Score   Error  Units
>> RecordSerializationBench.deserializeClasses        10  avgt 10    
>> 8.681 ± 0.155  us/op
>> RecordSerializationBench.deserializeClasses       100  avgt 10   
>> 32.496 ± 0.087  us/op
>> RecordSerializationBench.deserializeClasses      1000  avgt 10  
>> 279.014 ± 1.189  us/op
>> RecordSerializationBench.deserializeRecords        10  avgt 10    
>> 8.537 ± 0.032  us/op
>> RecordSerializationBench.deserializeRecords       100  avgt 10   
>> 31.451 ± 0.083  us/op
>> RecordSerializationBench.deserializeRecords      1000  avgt 10  
>> 250.854 ± 2.772  us/op
>>
>>
>> With that, more objects means advantage over classical classes 
>> instead of disadvantage.
>>
>>
>> [1] 
>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/RecordSerializationBench.java 
>>
>>
>> [2] 
>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.06/ 
>>
>>
>>
>> Regards, Peter
>>
>>