RFR: JDK-8184947:,ZipCoder performance improvements
Xueming Shen
xueming.shen at oracle.com
Fri Dec 8 23:09:31 UTC 2017
Hi,
Please help review the changes for j.u.z.ZipCoder/JDK-8184947 (which
also includes
cleanup/improvement work in java.lang.StringCoding.java to speed up
general String
coding performance, especially for UTF8).
issue: https://bugs.openjdk.java.net/browse/JDK-8184947
webrev: http://cr.openjdk.java.net/~sherman/8184947/webrev
jmh benchmark:
http://cr.openjdk.java.net/~sherman/8184947/ZipCodingBM.java
http://cr.openjdk.java.net/~sherman/8184947/StringCodingBM.java
Notes:
(1) StringCoding.de/encode() for new String()/String.getBytes() with
default charset.
For historical reason the existing SC.decode(byte[], off,
len)/encode(coder, val)
implementation has code to handle any "possible" UnsupportedEncodingExcetion
situation and turn to the slow "charset name" version of de/encode() for
real work.
Given the fact that the Charset.defaultCharset() now returns UTF8 as the
fallback
default charset if there is anything wrong to obtain a default charset
(we did that in
jdk7 or 8?), there is no need actually to handle the UEE. This also
provides the
opportunity to use fastpath for stateless UTF8/88591/ASCII de/encode(). The
benchmark data for newString_xxx/ getBytes_xxx (which uses the default
encoding,
UTF8 in this case) suggests a big speed up fo ascii-only String.
StringCodingBM size) Mode Cnt NEW Score Error OLD Score Error Units
getBytes_ASCII 16 avgt 5 21.155 ± 5.586 63.777 ± 54.262 ns/op
getBytes_ASCII 64 avgt 5 20.854 ± 6.237 98.988 ± 62.932 ns/op
getBytes_ASCII 256 avgt 5 38.291 ± 8.494 272.306 ± 77.951 ns/op
getBytes_Latin 16 avgt 5 80.968 ± 15.814 76.769 ± 38.512 ns/op
getBytes_Latin 64 avgt 5 163.078 ± 51.993 219.085 ± 42.665 ns/op
getBytes_Latin 256 avgt 5 759.548 ± 99.386 824.594 ± 763.735 ns/op
getBytes_Unicode 16 avgt 5 94.311 ± 22.189 124.185 ± 32.751 ns/op
getBytes_Unicode 64 avgt 5 289.603 ± 152.056 321.541 ± 103.703 ns/op
getBytes_Unicode 256 avgt 5 1253.098 ± 216.243 1201.667 ± 512.532 ns/op
newString_ASCII 16 avgt 5 33.273 ± 13.780 50.402 ± 17.574 ns/op
newString_ASCII 64 avgt 5 30.420 ± 6.207 84.989 ± 43.355 ns/op
newString_ASCII 256 avgt 5 54.391 ± 10.451 208.096 ± 102.716 ns/op
newString_Latin 16 avgt 5 115.606 ± 7.181 114.186 ± 36.310 ns/op
newString_Latin 64 avgt 5 393..710 ± 73.478 414.286 ± 176.837 ns/op
newString_Latin 256 avgt 5 1618.967 ± 289.044 1551.499 ± 487.904 ns/op
newString_Unicode 16 avgt 5 104.848 ± 32.694 127.558 ± 12.029 ns/op
newString_Unicode 64 avgt 5 377.894 ± 147.731 374.779 ± 53.028 ns/op
newString_Unicode 256 avgt 5 1557.977 ± 318.652 1457.236 ± 284.424 ns/op
(2) updated to "fast path" UTF8/8859-1/ASCII in all de/coding operation,
which are all
implemented in static /stateless methods. (benchmark for MS932 [4]
provide to make
sure no regression for "other" charsets)
(3) added "fast path" for "ascii-only' bytes for utf8
encoding/getBytes(). The benchmark
[1] suggests a big speedup for ascii-only getBytes() with limited cost
to non-ascii-only
cases. (this helps big for (4), the ZipCoder situation, which mainly
uses ascii only).
(4) java.util.zip.ZipCoder
This is where this patch actually started from. As the rfe suggested we
are now using
byte[] as the internal storage for the String class, the optimization we
put in ZipCoder
for UTF8 (which uses the byte[]/char[] interface of out UTF8
implementation to help
avoid the relatively heavy ByteBuffer/CharBuffer coding interface) now
appears to be
not that "optimized". The to/from char[] copy/paste has become a waste.
ZipCoder implementation can't use new String/String.getBytes() directly
because of the
the different malformed/unmappable character handing requirement. The
proposed
change here is to add a pair of special new String()/String.getBytes()
in StrngCoding
class to throw IAE instead of silent replacement, via (yet another)
SharedSecrets
interface. This brings us much faster de/encoding (30%-50% speed up) and
much less
memory usage (no more unnecessary byte[]/char[] allocation and in
default mode, there
is only ONE utf8 ZipCoder), on all "Jar/ZipEntry" related access
operations.
ZipCodeBenchMark [latest]
* "New Score" is with the patch
* getEntry() is mainly String.getBytes(), entries()/stream() is
mainly new String(bytes)).
Mode Cnt New Score Error Old Score Units
jf_entries avgt 20 0.582 ± 0.036 0.953 ± 0.108 ms/op
jf_getEntry avgt 20 1.506 ± 0.158 2.052 ± 0.171 ms/op
jf_stream avgt 20 0.698 ± 0.060 0.940 ± 0.067 ms/op
zf_entries avgt 20 0.691 ± 0.057 0.917 ± 0.080 ms/op
zf_getEntry avgt 20 1.459 ± 0.180 2.081 ± 0.161 ms/op
zf_stream avgt 20 0.626 ± 0.074 0.909 ± 0.075 ms/op
Thanks,
Sherman
[1] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.utf8
[2]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.8859_1
[3] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ascii
[4]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ms932
[5] http://cr.openjdk.java.net/~sherman/8184947/ZipCoding.bm
More information about the core-libs-dev
mailing list