RFR: JDK-8184947:,ZipCoder performance improvements

Fri Dec 8 23:09:31 UTC 2017

Hi,

Please help review the changes for j.u.z.ZipCoder/JDK-8184947 (which 
also includes
cleanup/improvement work in java.lang.StringCoding.java to speed up 
general String
coding performance, especially for UTF8).

issue: https://bugs.openjdk.java.net/browse/JDK-8184947
webrev: http://cr.openjdk.java.net/~sherman/8184947/webrev

jmh benchmark:
http://cr.openjdk.java.net/~sherman/8184947/ZipCodingBM.java
http://cr.openjdk.java.net/~sherman/8184947/StringCodingBM.java

Notes:

(1) StringCoding.de/encode() for new String()/String.getBytes() with 
default charset.

For historical reason the existing SC.decode(byte[], off, 
len)/encode(coder, val)
implementation has code to handle any "possible" UnsupportedEncodingExcetion
situation and turn to the slow "charset name" version of de/encode() for 
real work.
Given the fact that the Charset.defaultCharset() now returns UTF8 as the 
fallback
default charset if there is anything wrong to obtain a default charset 
(we did that in
jdk7 or 8?), there is no need actually to handle the UEE.  This also 
provides the
opportunity to use fastpath for stateless UTF8/88591/ASCII de/encode(). The
benchmark data for newString_xxx/ getBytes_xxx (which uses the default 
encoding,
UTF8  in this case) suggests a big speed up fo ascii-only String.

StringCodingBM         size)  Mode  Cnt   NEW Score   Error     OLD Score    Error  Units

getBytes_ASCII            16  avgt    5    21.155 Â±   5.586      63.777 Â±   54.262  ns/op
getBytes_ASCII            64  avgt    5    20.854 Â±   6.237      98.988 Â±   62.932  ns/op
getBytes_ASCII           256  avgt    5    38.291 Â±   8.494     272.306 Â±   77.951  ns/op
getBytes_Latin            16  avgt    5    80.968 Â±  15.814      76.769 Â±   38.512  ns/op
getBytes_Latin            64  avgt    5   163.078 Â±  51.993     219.085 Â±   42.665  ns/op
getBytes_Latin           256  avgt    5   759.548 Â±  99.386     824.594 Â±  763.735  ns/op
getBytes_Unicode          16  avgt    5    94.311 Â±  22.189     124.185 Â±   32.751  ns/op
getBytes_Unicode          64  avgt    5   289.603 Â± 152.056     321.541 Â±  103.703  ns/op
getBytes_Unicode         256  avgt    5  1253.098 Â± 216.243    1201.667 Â±  512.532  ns/op

newString_ASCII           16  avgt    5    33.273 Â±  13.780      50.402 Â±   17.574  ns/op
newString_ASCII           64  avgt    5    30.420 Â±   6.207      84.989 Â±   43.355  ns/op
newString_ASCII          256  avgt    5    54.391 Â±  10.451     208.096 Â±  102.716  ns/op
newString_Latin           16  avgt    5   115.606 Â±   7.181     114.186 Â±   36.310  ns/op
newString_Latin           64  avgt    5   393..710 Â±  73.478    414.286 Â±  176.837  ns/op
newString_Latin          256  avgt    5  1618.967 Â± 289.044    1551.499 Â±  487.904  ns/op
newString_Unicode         16  avgt    5   104.848 Â±  32.694     127.558 Â±   12.029  ns/op
newString_Unicode         64  avgt    5   377.894 Â± 147.731     374.779 Â±   53.028  ns/op
newString_Unicode        256  avgt    5  1557.977 Â± 318.652    1457.236 Â±  284.424  ns/op

(2) updated to "fast path" UTF8/8859-1/ASCII in all de/coding operation, 
which are all
implemented in static /stateless methods. (benchmark for MS932 [4] 
provide to make
sure no regression for "other" charsets)

(3) added "fast path" for "ascii-only' bytes for utf8 
encoding/getBytes(). The benchmark
[1] suggests a big speedup for ascii-only getBytes() with limited cost 
to non-ascii-only
cases. (this helps big for (4), the ZipCoder situation, which mainly 
uses ascii only).

(4) java.util.zip.ZipCoder

This is where this patch actually started from. As the rfe suggested we 
are now using
byte[] as the internal storage for the String class, the optimization we 
put in ZipCoder
for UTF8 (which uses the byte[]/char[] interface of out UTF8 
implementation to help
avoid the relatively heavy ByteBuffer/CharBuffer coding interface) now 
appears to be
not that "optimized". The to/from char[] copy/paste has become a waste.

ZipCoder  implementation can't use new String/String.getBytes() directly 
because of the
the different malformed/unmappable character handing requirement. The 
proposed
change here is to add a pair of special new String()/String.getBytes() 
in StrngCoding
class to throw IAE instead of silent replacement, via (yet another) 
SharedSecrets
interface. This brings us much faster de/encoding (30%-50% speed up) and 
much less
memory usage (no more unnecessary byte[]/char[] allocation and in 
default mode, there
is only ONE utf8 ZipCoder), on all  "Jar/ZipEntry" related access 
operations.

ZipCodeBenchMark [latest]
     * "New Score" is with the patch
     * getEntry() is mainly String.getBytes(), entries()/stream() is 
mainly new String(bytes)).

                Mode  Cnt     New Score   Error      Old Score         Units
jf_entries     avgt   20     0.582 Â±    0.036      0.953 Â±   0.108   ms/op
jf_getEntry    avgt   20     1.506 Â±    0.158      2.052 Â±   0.171   ms/op
jf_stream      avgt   20     0.698 Â±    0.060      0.940 Â±   0.067   ms/op
zf_entries     avgt   20     0.691 Â±    0.057      0.917 Â±   0.080   ms/op
zf_getEntry    avgt   20     1.459 Â±    0.180      2.081 Â±   0.161   ms/op
zf_stream      avgt   20     0.626 Â±    0.074      0.909 Â±   0.075   ms/op

Thanks,
Sherman

[1] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.utf8
[2]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.8859_1
[3] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ascii
[4]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ms932
[5] http://cr.openjdk.java.net/~sherman/8184947/ZipCoding.bm