Codereview request for 6653797: Reimplement JDK charset repository charsets.jar

Ulf Zibis Ulf.Zibis at gmx.de
Mon Jul 16 16:30:04 UTC 2012


Hi Sherman,

as I just said for 7183053, I can't look in the details at the moment, as I do not have suitable 
environment installed at the moment.

Just one comment: I think it should be possible to share the mapping data partly across charsets, so 
the charsets.jar would be decreased again more?

-Ulf


Am 16.07.2012 00:12, schrieb Xueming Shen:
> Hi
>
> This changeset includes the migration of our JIS0201/0208/0212 based single/
> double-byte charsets to the new mapping based implementation. This is the
> left-over of the effort we put in JDK7 to re-implement most of our charsets in
> charsets.jar to (1)have better performance (2) small storage foot print and (3)
> ease the maintenance burden.
>
> http://cr.openjdk.java.net/~sherman/6653797/webrev/
>
> Notes of the implementation:
>
> (1) jis0201/0208/0212 and their variants are now generated from the mapping table
> during the build time. (See those new .map *.nr and *.c2b tables)
>
> (2) EUC_JP/LINUX_OPEN, SJIS, PCK, ISO2022_JP and its variants are now using these
> new jis0201/02080212 charsets.
>
> (3) Those in red (in webrev) are the old implementation, since no charset uses them
> anymore, I removed them from the repository)
>
> (4) There are two approaches for PCK and SJIS. PCK.java_v1 and SJIS.java_v1 are the
> one that follows the old implementation, which decode/encodes base on the
> jis0201/0208 (and the variants) mapping via Ken's algorithm. This is known to be
> slow and buggy (the algothrim does not take care of illegal sjis cp, see #6653797
> and http://cr.openjdk.java.net/~sherman/6653797/Sjis2Jis.java)
> So in this charset, I generated the direct mapping tables for sjis and pck and use
> the "general" DoubleByte base class for these two charsets. This results in much
> faster decoding/encoding and correct mapping for all code points. The downside
> of this approach is that it adds about 50k uncompressed side to the charsets.jar.
> But given this change actually reduces about 300K from the rt.jar, we still get
> a net 250K, so I decided to go with this approach for better performance.
>
> It appears to be lots of files (80+) in the webrev, but that number includes the
> removed old implementation and the tests I put in to guarantee the identical
> de/encoding result from the old and new implementations (those OLD... test
> cases), the change is actually not that big:-) So please help review. I can then
> put this multi-year efforts into rest.
>
> -Sherman
>
>
>
>
>
>





More information about the core-libs-dev mailing list