RFR: 8247532: Records deserialization is slow

Tue Jun 23 08:06:30 UTC 2020

Hi,

On 2020-06-23 09:49, Peter Levart wrote:
> Hi Chris, Claes,
> 
> 
> Ok then, here's with benchmark included:
> 
> 
> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.07/
> 
> 
> I haven't been able to run the benchmark with "make test" though. I have 
> tried various ways to pass javac options to build like:
> 
> 
> make test TEST='micro:org.openjdk.bench.java.io.RecordDeserialization' TEST_OPTS="VM_OPTIONS=--enable-preview --release=16"
>
> 
> ...but javac doesn't seem to get them. Is there some secret option to 
> achieve that?

Hmm, we might as well have the microbenchmarks build with
--enable-preview on by default. Try this:

diff -r 52741f85bf23 make/test/BuildMicrobenchmark.gmk

--- a/make/test/BuildMicrobenchmark.gmk	Tue Jun 23 09:54:42 2020 +0200
+++ b/make/test/BuildMicrobenchmark.gmk	Tue Jun 23 09:59:29 2020 +0200
@@ -93,7 +93,7 @@
      DISABLED_WARNINGS := processing rawtypes cast serial, \
      SRC := $(MICROBENCHMARK_SRC), \
      BIN := $(MICROBENCHMARK_CLASSES), \
-    JAVA_FLAGS := --add-modules jdk.unsupported --limit-modules 
java.management, \
+    JAVA_FLAGS := --enable-preview --add-modules jdk.unsupported 
--limit-modules java.management, \
  ))

  $(BUILD_JDK_MICROBENCHMARK): $(JMH_COMPILE_JARS)

> 
> 
> Otherwise, I simulated what would happen when there are more then 10 
> ObjectStreamClass layouts for same class rapidly interchanging, so that 
> they push each other out of the cache, by temporarily setting cache's 
> MAX_SIZE = 0. Note that this is worst case scenario:
> 
> 
> Benchmark                                      (length)  Mode  Cnt    Score    Error  Units
> RecordDeserializationBench.deserializeClasses        10  avgt   10    9.393 ±  0.287  us/op
> RecordDeserializationBench.deserializeClasses       100  avgt   10   35.642 ±  0.977  us/op
> RecordDeserializationBench.deserializeClasses      1000  avgt   10  293.769 ±  7.321  us/op
> RecordDeserializationBench.deserializeRecords        10  avgt   10   15.335 ±  0.496  us/op
> RecordDeserializationBench.deserializeRecords       100  avgt   10  211.427 ± 11.908  us/op
> RecordDeserializationBench.deserializeRecords      1000  avgt   10  990.398 ± 26.681  us/op
> 
> 
> This is using JMH option '-gc true' to force GC after each iteration of 
> benchmark. Without it, I get a big (~4s) full-GC pause just in the 
> middle of a run with length=100:
> 
> 
> Iteration   1: 528.577 us/op
> Iteration   2: 580.404 us/op
> Iteration   3: 4438.228 us/op
> Iteration   4: 644.532 us/op
> Iteration   5: 698.493 us/op
> Iteration   6: 800.738 us/op
> Iteration   7: 929.791 us/op
> Iteration   8: 870.946 us/op
> Iteration   9: 863.416 us/op
> Iteration  10: 916.508 us/op
> 
> 
> ...so results are a bit off because of that:
> 
> 
> Benchmark                                      (length)  Mode  Cnt     Score      Error  Units
> RecordDeserializationBench.deserializeClasses        10  avgt   10     8.263 ±    0.043  us/op
> RecordDeserializationBench.deserializeClasses       100  avgt   10    33.406 ±    0.160  us/op
> RecordDeserializationBench.deserializeClasses      1000  avgt   10   287.595 ±    0.960  us/op
> RecordDeserializationBench.deserializeRecords        10  avgt   10    15.270 ±    0.080  us/op
> RecordDeserializationBench.deserializeRecords       100  avgt   10  1127.163 ± 1771.892  us/op
> RecordDeserializationBench.deserializeRecords      1000  avgt   10  2003.235 ±  227.159  us/op
> 
> 
> Yes, there is quite a bit of GCing going on when cache is thrashing. 
> Note that I haven't tuned GC in any way and I'm running this on a 
> machine with 64GiB of RAM so heap is allowed to grow quite big and G1 is 
> used by default I think.
> 
> 
> This is still no worse than before the patch:
> 
> 
> Benchmark                                 (length)  Mode  Cnt     Score    Error  Units
> RecordDeserialization.deserializeClasses        10  avgt   10     8.382 :  0.013  us/op
> RecordDeserialization.deserializeClasses       100  avgt   10    33.736 :  0.171  us/op
> RecordDeserialization.deserializeClasses      1000  avgt   10   271.224 :  0.953  us/op
> RecordDeserialization.deserializeRecords        10  avgt   10    58.606 :  0.446  us/op
> RecordDeserialization.deserializeRecords       100  avgt   10   530.044 :  1.752  us/op
> RecordDeserialization.deserializeRecords      1000  avgt   10  5335.624 : 44.942  us/op
> 
> 
> ... since caching of adapted method handle for multiple objects withing 
> single stream is still effective.
> 
> I doubt there will ever be more than 10 variants/versions of the same 
> record class deserialized by the same VM and in rapid succession, so I 
> think this should not be an issue. We could add a system property to 
> control the MAX_SIZE of cache if you think it is needed.

Thanks for running the numbers on this, and I agree - it seems
outlandishly improbable (most will only see one, and if you have to
maintain 10+ different serialized shapes of some record you likely have 
bigger problems).

I'd say let's keep it constant unless someone actually asks for it.

/Claes

> 
> 
> Regards, Peter
> 
> 
> On 6/22/20 1:04 AM, Claes Redestad wrote:
>> Hi Peter,
>>
>> patch and results look great!
>>
>> My only real comment on this is that I think the microbenchmark would be
>> a valuable contribution, too.
>>
>> It'd also be interesting to explore how poor performance would become if
>> we'd hit the (artificial) 11 layouts limit, e.g, by cycling through
>> 10, 11, or 12 different shapes.
>>
>> /Claes
>>
>> On 2020-06-21 19:16, Peter Levart wrote:
>>> Hi,
>>>
>>>
>>> When re-running the benchmark [1] with different lengths of 
>>> serialized arrays of records, I found that, compared to classical 
>>> classes, lookup into the cache of adapted method handles starts to 
>>> show when the length of array is larger (# of instances of same 
>>> record type deserialized in single stream). Each record deserialized 
>>> must lookup the method handle in a ConcurrentHashMap:
>>>
>>>
>>> Benchmark                                    (length)  Mode Cnt    
>>> Score   Error  Units
>>> RecordSerializationBench.deserializeClasses        10  avgt 10    
>>> 8.088 ± 0.013  us/op
>>> RecordSerializationBench.deserializeClasses       100  avgt 10   
>>> 32.171 ± 0.324  us/op
>>> RecordSerializationBench.deserializeClasses      1000  avgt 10  
>>> 279.762 ± 3.072  us/op
>>> RecordSerializationBench.deserializeRecords        10  avgt 10    
>>> 9.011 ± 0.027  us/op
>>> RecordSerializationBench.deserializeRecords       100  avgt 10   
>>> 33.206 ± 0.514  us/op
>>> RecordSerializationBench.deserializeRecords      1000  avgt 10  
>>> 325.137 ± 0.969  us/op
>>>
>>>
>>> ...so keeping the correctly shaped adapted method handle in the 
>>> per-serialization-session ObjectStreamClass instance [2] starts to 
>>> make sense:
>>>
>>>
>>> Benchmark                                    (length)  Mode Cnt    
>>> Score   Error  Units
>>> RecordSerializationBench.deserializeClasses        10  avgt 10    
>>> 8.681 ± 0.155  us/op
>>> RecordSerializationBench.deserializeClasses       100  avgt 10   
>>> 32.496 ± 0.087  us/op
>>> RecordSerializationBench.deserializeClasses      1000  avgt 10  
>>> 279.014 ± 1.189  us/op
>>> RecordSerializationBench.deserializeRecords        10  avgt 10    
>>> 8.537 ± 0.032  us/op
>>> RecordSerializationBench.deserializeRecords       100  avgt 10   
>>> 31.451 ± 0.083  us/op
>>> RecordSerializationBench.deserializeRecords      1000  avgt 10  
>>> 250.854 ± 2.772  us/op
>>>
>>>
>>> With that, more objects means advantage over classical classes 
>>> instead of disadvantage.
>>>
>>>
>>> [1] 
>>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/RecordSerializationBench.java 
>>>
>>>
>>> [2] 
>>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.06/ 
>>>
>>>
>>>
>>> Regards, Peter
>>>
>>>