Codereview request for 6653797: Reimplement JDK charset repository charsets.jar

Mon Jul 16 16:59:13 UTC 2012

On 7/16/2012 9:30 AM, Ulf Zibis wrote:
> Hi Sherman,
>
> as I just said for 7183053, I can't look in the details at the moment, 
> as I do not have suitable environment installed at the moment.
>
> Just one comment: I think it should be possible to share the mapping 
> data partly across charsets, so the charsets.jar would be decreased 
> again more?

Yes,  it might be desirable to share some of the mappings, especially 
among those variants. But as
I suggested at the very beginning of the project, the priority for now 
is to move all the charsets to
the new mapping based/build-time generated implementation, then it opens 
the door for new
optimization,  any improvement on those base classes and the "generator" 
tools (to share the
mapping, for example) will be shared by all the sub-classes/classes. 
While it might be ideal to
achieve all the goals at one shot, our resource restrict really does not 
allow me to spend most
of my time on it (mapping re-generate really takes time and I have to 
test from various angles
to make sure it does not break anything and not miss any corner case). 
This is more like a
side-project (sure I do have a JEP for it but...) for now and I just 
found two "spare" weeks to push
these two RFEs out.  I might have more time on charsets later around the 
end development
stage of JDK8.

-Sherman

>
> -Ulf
>
>
> Am 16.07.2012 00:12, schrieb Xueming Shen:
>> Hi
>>
>> This changeset includes the migration of our JIS0201/0208/0212 based 
>> single/
>> double-byte charsets to the new mapping based implementation. This is 
>> the
>> left-over of the effort we put in JDK7 to re-implement most of our 
>> charsets in
>> charsets.jar to (1)have better performance (2) small storage foot 
>> print and (3)
>> ease the maintenance burden.
>>
>> http://cr.openjdk.java.net/~sherman/6653797/webrev/
>>
>> Notes of the implementation:
>>
>> (1) jis0201/0208/0212 and their variants are now generated from the 
>> mapping table
>> during the build time. (See those new .map *.nr and *.c2b tables)
>>
>> (2) EUC_JP/LINUX_OPEN, SJIS, PCK, ISO2022_JP and its variants are now 
>> using these
>> new jis0201/02080212 charsets.
>>
>> (3) Those in red (in webrev) are the old implementation, since no 
>> charset uses them
>> anymore, I removed them from the repository)
>>
>> (4) There are two approaches for PCK and SJIS. PCK.java_v1 and 
>> SJIS.java_v1 are the
>> one that follows the old implementation, which decode/encodes base on 
>> the
>> jis0201/0208 (and the variants) mapping via Ken's algorithm. This is 
>> known to be
>> slow and buggy (the algothrim does not take care of illegal sjis cp, 
>> see #6653797
>> and http://cr.openjdk.java.net/~sherman/6653797/Sjis2Jis.java)
>> So in this charset, I generated the direct mapping tables for sjis 
>> and pck and use
>> the "general" DoubleByte base class for these two charsets. This 
>> results in much
>> faster decoding/encoding and correct mapping for all code points. The 
>> downside
>> of this approach is that it adds about 50k uncompressed side to the 
>> charsets.jar.
>> But given this change actually reduces about 300K from the rt.jar, we 
>> still get
>> a net 250K, so I decided to go with this approach for better 
>> performance.
>>
>> It appears to be lots of files (80+) in the webrev, but that number 
>> includes the
>> removed old implementation and the tests I put in to guarantee the 
>> identical
>> de/encoding result from the old and new implementations (those OLD... 
>> test
>> cases), the change is actually not that big:-) So please help review. 
>> I can then
>> put this multi-year efforts into rest.
>>
>> -Sherman
>>
>>
>>
>>
>>
>>
>
>