RFR: 8247532, 8248135: Records deserialization is slow + Build microbenchmarks with --enable-preview
Peter Levart
peter.levart at gmail.com
Tue Jun 23 09:28:54 UTC 2020
Thanks, Magnus.
Peter
On 6/23/20 11:27 AM, Magnus Ihse Bursie wrote:
>
>
> On 2020-06-23 11:17, Peter Levart wrote:
>> Including build-dev since this patch is adding new issue 8248135:
>>
>>
>> https://bugs.openjdk.java.net/browse/JDK-8248135
>>
>>
>> So here's new webrev with a patch for building benchmarks with
>> --enable-preview included:
>>
>>
>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.08/
>>
> Build changes look good.
>
> /Magnus
>>
>>
>> Regards, Peter
>>
>>
>> On 6/23/20 10:23 AM, Claes Redestad wrote:
>>>
>>>
>>> On 2020-06-23 10:06, Claes Redestad wrote:
>>>> Hi,
>>>>
>>>> On 2020-06-23 09:49, Peter Levart wrote:
>>>>> Hi Chris, Claes,
>>>>>
>>>>>
>>>>> Ok then, here's with benchmark included:
>>>>>
>>>>>
>>>>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.07/
>>>>>
>>>>>
>>>>>
>>>>> I haven't been able to run the benchmark with "make test" though.
>>>>> I have tried various ways to pass javac options to build like:
>>>>>
>>>>>
>>>>> make test
>>>>> TEST='micro:org.openjdk.bench.java.io.RecordDeserialization'
>>>>> TEST_OPTS="VM_OPTIONS=--enable-preview --release=16"
>>>>>
>>>>>
>>>>> ...but javac doesn't seem to get them. Is there some secret option
>>>>> to achieve that?
>>>>
>>>> Hmm, we might as well have the microbenchmarks build with
>>>> --enable-preview on by default. Try this:
>>>
>>> Fixed:
>>>
>>> diff -r f2e1cd498381 make/test/BuildMicrobenchmark.gmk
>>> --- a/make/test/BuildMicrobenchmark.gmk Tue Jun 23 10:08:35 2020 +0200
>>> +++ b/make/test/BuildMicrobenchmark.gmk Tue Jun 23 10:33:17 2020 +0200
>>> @@ -90,10 +90,11 @@
>>> TARGET_RELEASE := $(TARGET_RELEASE_NEWJDK_UPGRADED), \
>>> SMALL_JAVA := false, \
>>> CLASSPATH := $(MICROBENCHMARK_CLASSPATH), \
>>> - DISABLED_WARNINGS := processing rawtypes cast serial, \
>>> + DISABLED_WARNINGS := processing rawtypes cast serial preview, \
>>> SRC := $(MICROBENCHMARK_SRC), \
>>> BIN := $(MICROBENCHMARK_CLASSES), \
>>> JAVA_FLAGS := --add-modules jdk.unsupported --limit-modules
>>> java.management, \
>>> + JAVAC_FLAGS := --enable-preview, \
>>> ))
>>>
>>> $(BUILD_JDK_MICROBENCHMARK): $(JMH_COMPILE_JARS)
>>>
>>> I verified this works with your micro, and doesn't seem to cause
>>> any issues elsewhere:
>>>
>>> make test TEST=micro:RecordDeserialization
>>>
>>> I can shepherd this as a separate fix for documentation purposes, but
>>> feel free to include it in your patch and ping build-dev at ..
>>>
>>> /Claes
>>>
>>>>
>>>> diff -r 52741f85bf23 make/test/BuildMicrobenchmark.gmk
>>>> --- a/make/test/BuildMicrobenchmark.gmk Tue Jun 23 09:54:42 2020
>>>> +0200
>>>> +++ b/make/test/BuildMicrobenchmark.gmk Tue Jun 23 09:59:29 2020
>>>> +0200
>>>> @@ -93,7 +93,7 @@
>>>> DISABLED_WARNINGS := processing rawtypes cast serial, \
>>>> SRC := $(MICROBENCHMARK_SRC), \
>>>> BIN := $(MICROBENCHMARK_CLASSES), \
>>>> - JAVA_FLAGS := --add-modules jdk.unsupported --limit-modules
>>>> java.management, \
>>>> + JAVA_FLAGS := --enable-preview --add-modules jdk.unsupported
>>>> --limit-modules java.management, \
>>>> ))
>>>>
>>>> $(BUILD_JDK_MICROBENCHMARK): $(JMH_COMPILE_JARS)
>>>>
>>>>>
>>>>>
>>>>> Otherwise, I simulated what would happen when there are more then
>>>>> 10 ObjectStreamClass layouts for same class rapidly interchanging,
>>>>> so that they push each other out of the cache, by temporarily
>>>>> setting cache's MAX_SIZE = 0. Note that this is worst case scenario:
>>>>>
>>>>>
>>>>> Benchmark (length) Mode
>>>>> Cnt Score Error Units
>>>>> RecordDeserializationBench.deserializeClasses 10 avgt
>>>>> 10 9.393 ± 0.287 us/op
>>>>> RecordDeserializationBench.deserializeClasses 100 avgt
>>>>> 10 35.642 ± 0.977 us/op
>>>>> RecordDeserializationBench.deserializeClasses 1000 avgt 10
>>>>> 293.769 ± 7.321 us/op
>>>>> RecordDeserializationBench.deserializeRecords 10 avgt
>>>>> 10 15.335 ± 0.496 us/op
>>>>> RecordDeserializationBench.deserializeRecords 100 avgt 10
>>>>> 211.427 ± 11.908 us/op
>>>>> RecordDeserializationBench.deserializeRecords 1000 avgt 10
>>>>> 990.398 ± 26.681 us/op
>>>>>
>>>>>
>>>>> This is using JMH option '-gc true' to force GC after each
>>>>> iteration of benchmark. Without it, I get a big (~4s) full-GC
>>>>> pause just in the middle of a run with length=100:
>>>>>
>>>>>
>>>>> Iteration 1: 528.577 us/op
>>>>> Iteration 2: 580.404 us/op
>>>>> Iteration 3: 4438.228 us/op
>>>>> Iteration 4: 644.532 us/op
>>>>> Iteration 5: 698.493 us/op
>>>>> Iteration 6: 800.738 us/op
>>>>> Iteration 7: 929.791 us/op
>>>>> Iteration 8: 870.946 us/op
>>>>> Iteration 9: 863.416 us/op
>>>>> Iteration 10: 916.508 us/op
>>>>>
>>>>>
>>>>> ...so results are a bit off because of that:
>>>>>
>>>>>
>>>>> Benchmark (length) Mode
>>>>> Cnt Score Error Units
>>>>> RecordDeserializationBench.deserializeClasses 10 avgt
>>>>> 10 8.263 ± 0.043 us/op
>>>>> RecordDeserializationBench.deserializeClasses 100 avgt
>>>>> 10 33.406 ± 0.160 us/op
>>>>> RecordDeserializationBench.deserializeClasses 1000 avgt
>>>>> 10 287.595 ± 0.960 us/op
>>>>> RecordDeserializationBench.deserializeRecords 10 avgt
>>>>> 10 15.270 ± 0.080 us/op
>>>>> RecordDeserializationBench.deserializeRecords 100 avgt 10
>>>>> 1127.163 ± 1771.892 us/op
>>>>> RecordDeserializationBench.deserializeRecords 1000 avgt 10
>>>>> 2003.235 ± 227.159 us/op
>>>>>
>>>>>
>>>>> Yes, there is quite a bit of GCing going on when cache is
>>>>> thrashing. Note that I haven't tuned GC in any way and I'm running
>>>>> this on a machine with 64GiB of RAM so heap is allowed to grow
>>>>> quite big and G1 is used by default I think.
>>>>>
>>>>>
>>>>> This is still no worse than before the patch:
>>>>>
>>>>>
>>>>> Benchmark (length) Mode Cnt
>>>>> Score Error Units
>>>>> RecordDeserialization.deserializeClasses 10 avgt 10
>>>>> 8.382 : 0.013 us/op
>>>>> RecordDeserialization.deserializeClasses 100 avgt 10
>>>>> 33.736 : 0.171 us/op
>>>>> RecordDeserialization.deserializeClasses 1000 avgt 10
>>>>> 271.224 : 0.953 us/op
>>>>> RecordDeserialization.deserializeRecords 10 avgt 10
>>>>> 58.606 : 0.446 us/op
>>>>> RecordDeserialization.deserializeRecords 100 avgt 10
>>>>> 530.044 : 1.752 us/op
>>>>> RecordDeserialization.deserializeRecords 1000 avgt 10
>>>>> 5335.624 : 44.942 us/op
>>>>>
>>>>>
>>>>> ... since caching of adapted method handle for multiple objects
>>>>> withing single stream is still effective.
>>>>>
>>>>> I doubt there will ever be more than 10 variants/versions of the
>>>>> same record class deserialized by the same VM and in rapid
>>>>> succession, so I think this should not be an issue. We could add a
>>>>> system property to control the MAX_SIZE of cache if you think it
>>>>> is needed.
>>>>
>>>> Thanks for running the numbers on this, and I agree - it seems
>>>> outlandishly improbable (most will only see one, and if you have to
>>>> maintain 10+ different serialized shapes of some record you likely
>>>> have bigger problems).
>>>>
>>>> I'd say let's keep it constant unless someone actually asks for it.
>>>>
>>>> /Claes
>>>>
>>>>>
>>>>>
>>>>> Regards, Peter
>>>>>
>>>>>
>>>>> On 6/22/20 1:04 AM, Claes Redestad wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> patch and results look great!
>>>>>>
>>>>>> My only real comment on this is that I think the microbenchmark
>>>>>> would be
>>>>>> a valuable contribution, too.
>>>>>>
>>>>>> It'd also be interesting to explore how poor performance would
>>>>>> become if
>>>>>> we'd hit the (artificial) 11 layouts limit, e.g, by cycling through
>>>>>> 10, 11, or 12 different shapes.
>>>>>>
>>>>>> /Claes
>>>>>>
>>>>>> On 2020-06-21 19:16, Peter Levart wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> When re-running the benchmark [1] with different lengths of
>>>>>>> serialized arrays of records, I found that, compared to
>>>>>>> classical classes, lookup into the cache of adapted method
>>>>>>> handles starts to show when the length of array is larger (# of
>>>>>>> instances of same record type deserialized in single stream).
>>>>>>> Each record deserialized must lookup the method handle in a
>>>>>>> ConcurrentHashMap:
>>>>>>>
>>>>>>>
>>>>>>> Benchmark (length) Mode Cnt
>>>>>>> Score Error Units
>>>>>>> RecordSerializationBench.deserializeClasses 10 avgt 10
>>>>>>> 8.088 ± 0.013 us/op
>>>>>>> RecordSerializationBench.deserializeClasses 100 avgt 10
>>>>>>> 32.171 ± 0.324 us/op
>>>>>>> RecordSerializationBench.deserializeClasses 1000 avgt 10
>>>>>>> 279.762 ± 3.072 us/op
>>>>>>> RecordSerializationBench.deserializeRecords 10 avgt 10
>>>>>>> 9.011 ± 0.027 us/op
>>>>>>> RecordSerializationBench.deserializeRecords 100 avgt 10
>>>>>>> 33.206 ± 0.514 us/op
>>>>>>> RecordSerializationBench.deserializeRecords 1000 avgt 10
>>>>>>> 325.137 ± 0.969 us/op
>>>>>>>
>>>>>>>
>>>>>>> ...so keeping the correctly shaped adapted method handle in the
>>>>>>> per-serialization-session ObjectStreamClass instance [2] starts
>>>>>>> to make sense:
>>>>>>>
>>>>>>>
>>>>>>> Benchmark (length) Mode Cnt
>>>>>>> Score Error Units
>>>>>>> RecordSerializationBench.deserializeClasses 10 avgt 10
>>>>>>> 8.681 ± 0.155 us/op
>>>>>>> RecordSerializationBench.deserializeClasses 100 avgt 10
>>>>>>> 32.496 ± 0.087 us/op
>>>>>>> RecordSerializationBench.deserializeClasses 1000 avgt 10
>>>>>>> 279.014 ± 1.189 us/op
>>>>>>> RecordSerializationBench.deserializeRecords 10 avgt 10
>>>>>>> 8.537 ± 0.032 us/op
>>>>>>> RecordSerializationBench.deserializeRecords 100 avgt 10
>>>>>>> 31.451 ± 0.083 us/op
>>>>>>> RecordSerializationBench.deserializeRecords 1000 avgt 10
>>>>>>> 250.854 ± 2.772 us/op
>>>>>>>
>>>>>>>
>>>>>>> With that, more objects means advantage over classical classes
>>>>>>> instead of disadvantage.
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/RecordSerializationBench.java
>>>>>>>
>>>>>>>
>>>>>>> [2]
>>>>>>> http://cr.openjdk.java.net/~plevart/jdk-dev/RecordsDeserialization/webrev.06/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards, Peter
>>>>>>>
>>>>>>>
>
More information about the build-dev
mailing list