RFR: 8247532: Records deserialization is slow
Peter Levart
peter.levart at gmail.com
Tue Jun 23 07:49:50 UTC 2020
Hi Chris, Claes,
Ok then, here's with benchmark included:
http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.07/
I haven't been able to run the benchmark with "make test" though. I have
tried various ways to pass javac options to build like:
make test TEST='micro:org.openjdk.bench.java.io.RecordDeserialization' TEST_OPTS="VM_OPTIONS=--enable-preview --release=16"
...but javac doesn't seem to get them. Is there some secret option to
achieve that?
Otherwise, I simulated what would happen when there are more then 10
ObjectStreamClass layouts for same class rapidly interchanging, so that
they push each other out of the cache, by temporarily setting cache's
MAX_SIZE = 0. Note that this is worst case scenario:
Benchmark (length) Mode Cnt Score Error Units
RecordDeserializationBench.deserializeClasses 10 avgt 10 9.393 ± 0.287 us/op
RecordDeserializationBench.deserializeClasses 100 avgt 10 35.642 ± 0.977 us/op
RecordDeserializationBench.deserializeClasses 1000 avgt 10 293.769 ± 7.321 us/op
RecordDeserializationBench.deserializeRecords 10 avgt 10 15.335 ± 0.496 us/op
RecordDeserializationBench.deserializeRecords 100 avgt 10 211.427 ± 11.908 us/op
RecordDeserializationBench.deserializeRecords 1000 avgt 10 990.398 ± 26.681 us/op
This is using JMH option '-gc true' to force GC after each iteration of
benchmark. Without it, I get a big (~4s) full-GC pause just in the
middle of a run with length=100:
Iteration 1: 528.577 us/op
Iteration 2: 580.404 us/op
Iteration 3: 4438.228 us/op
Iteration 4: 644.532 us/op
Iteration 5: 698.493 us/op
Iteration 6: 800.738 us/op
Iteration 7: 929.791 us/op
Iteration 8: 870.946 us/op
Iteration 9: 863.416 us/op
Iteration 10: 916.508 us/op
...so results are a bit off because of that:
Benchmark (length) Mode Cnt Score Error Units
RecordDeserializationBench.deserializeClasses 10 avgt 10 8.263 ± 0.043 us/op
RecordDeserializationBench.deserializeClasses 100 avgt 10 33.406 ± 0.160 us/op
RecordDeserializationBench.deserializeClasses 1000 avgt 10 287.595 ± 0.960 us/op
RecordDeserializationBench.deserializeRecords 10 avgt 10 15.270 ± 0.080 us/op
RecordDeserializationBench.deserializeRecords 100 avgt 10 1127.163 ± 1771.892 us/op
RecordDeserializationBench.deserializeRecords 1000 avgt 10 2003.235 ± 227.159 us/op
Yes, there is quite a bit of GCing going on when cache is thrashing.
Note that I haven't tuned GC in any way and I'm running this on a
machine with 64GiB of RAM so heap is allowed to grow quite big and G1 is
used by default I think.
This is still no worse than before the patch:
Benchmark (length) Mode Cnt Score Error Units
RecordDeserialization.deserializeClasses 10 avgt 10 8.382 : 0.013 us/op
RecordDeserialization.deserializeClasses 100 avgt 10 33.736 : 0.171 us/op
RecordDeserialization.deserializeClasses 1000 avgt 10 271.224 : 0.953 us/op
RecordDeserialization.deserializeRecords 10 avgt 10 58.606 : 0.446 us/op
RecordDeserialization.deserializeRecords 100 avgt 10 530.044 : 1.752 us/op
RecordDeserialization.deserializeRecords 1000 avgt 10 5335.624 : 44.942 us/op
... since caching of adapted method handle for multiple objects withing
single stream is still effective.
I doubt there will ever be more than 10 variants/versions of the same
record class deserialized by the same VM and in rapid succession, so I
think this should not be an issue. We could add a system property to
control the MAX_SIZE of cache if you think it is needed.
Regards, Peter
On 6/22/20 1:04 AM, Claes Redestad wrote:
> Hi Peter,
>
> patch and results look great!
>
> My only real comment on this is that I think the microbenchmark would be
> a valuable contribution, too.
>
> It'd also be interesting to explore how poor performance would become if
> we'd hit the (artificial) 11 layouts limit, e.g, by cycling through
> 10, 11, or 12 different shapes.
>
> /Claes
>
> On 2020-06-21 19:16, Peter Levart wrote:
>> Hi,
>>
>>
>> When re-running the benchmark [1] with different lengths of
>> serialized arrays of records, I found that, compared to classical
>> classes, lookup into the cache of adapted method handles starts to
>> show when the length of array is larger (# of instances of same
>> record type deserialized in single stream). Each record deserialized
>> must lookup the method handle in a ConcurrentHashMap:
>>
>>
>> Benchmark (length) Mode Cnt
>> Score Error Units
>> RecordSerializationBench.deserializeClasses 10 avgt 10
>> 8.088 ± 0.013 us/op
>> RecordSerializationBench.deserializeClasses 100 avgt 10
>> 32.171 ± 0.324 us/op
>> RecordSerializationBench.deserializeClasses 1000 avgt 10
>> 279.762 ± 3.072 us/op
>> RecordSerializationBench.deserializeRecords 10 avgt 10
>> 9.011 ± 0.027 us/op
>> RecordSerializationBench.deserializeRecords 100 avgt 10
>> 33.206 ± 0.514 us/op
>> RecordSerializationBench.deserializeRecords 1000 avgt 10
>> 325.137 ± 0.969 us/op
>>
>>
>> ...so keeping the correctly shaped adapted method handle in the
>> per-serialization-session ObjectStreamClass instance [2] starts to
>> make sense:
>>
>>
>> Benchmark (length) Mode Cnt
>> Score Error Units
>> RecordSerializationBench.deserializeClasses 10 avgt 10
>> 8.681 ± 0.155 us/op
>> RecordSerializationBench.deserializeClasses 100 avgt 10
>> 32.496 ± 0.087 us/op
>> RecordSerializationBench.deserializeClasses 1000 avgt 10
>> 279.014 ± 1.189 us/op
>> RecordSerializationBench.deserializeRecords 10 avgt 10
>> 8.537 ± 0.032 us/op
>> RecordSerializationBench.deserializeRecords 100 avgt 10
>> 31.451 ± 0.083 us/op
>> RecordSerializationBench.deserializeRecords 1000 avgt 10
>> 250.854 ± 2.772 us/op
>>
>>
>> With that, more objects means advantage over classical classes
>> instead of disadvantage.
>>
>>
>> [1]
>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/RecordSerializationBench.java
>>
>>
>> [2]
>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.06/
>>
>>
>>
>> Regards, Peter
>>
>>
More information about the core-libs-dev
mailing list