Fast String...

Wed Mar 25 14:12:36 UTC 2009

Am 25.03.2009 04:41, Xueming Shen schrieb:
> Ulf Zibis wrote:
>> Am 25.03.2009 02:13, Xueming Shen schrieb:
>>> reduce size is a good thing, that was my primary goal, to reduce the 
>>> charsets.jar to under 2M, and
>>> doable if we can put the data outside the class file, that was what 
>>> I have done...the concern is the
>>> startup time. one alternative is to pick this approach for those 
>>> charsets that don't care the startup,
>>> such as the ibm charsets and the one on solaris:-)
>>>
>>> compared to stored the data in class file and out  of the class, you 
>>> can still eliminate the c2b data
>>> (generated from b2c),  the difference is the String constants stored 
>>> in utf8 probably take 3 bytes
>>> but 2 bytes in a ".dat" file....about 15%
>>
>> Your generated charset classes have 2 K in average, my data files 
>> have 250 bytes in average (including aliases + historicalName, so you 
>> should subtract 50..200 bytes for comparison).
>> See: 
>> https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/releases/nio_charset_M4.jar?rev=682&view=log 
>>
>>
> it's unfair:-) you put me totally in defensive position:-) Martin can 
> testify i started to sell this idea of extracting all
> mapping data into dat file and to have only one single base class to 
> load in dat and construct the charset class
> on the fly, 2 years ago:-) so i know how small it can be.
>
> my 15% data is not for singlebyte, i'm talking about the doublebyte,

Ah, ok. This makes it clearer.

I totally agree with you, saving bytes only in singlebyte charsets isn't 
much worth. But it was good exercise for me, to find out relevant 
techniques.
You may would wonder, how I can serve a coder for 256 2-byte chars with 
a 69 byte data file (e.g. koi8-u.dat), which also includes it's numerous 
names.
The trick is, that I share map data between charsets, if they are 
similar enough. This is done by my sun.nio.cs.CharsetStream class.

I would wonder, if there isn't heavy concordance between doublebyte 
maps, which could be shared. I have designed CharsetStream class to be 
extendible for doublebyte requirements. Additionally, I think it should 
be possible to partly share mapping tables in memory, as the doublebyte 
b2c maps in general seem to be sliced.

The big problem is the lack in startup time, which for me seems to be 
caused by the dilly-dallying resource stream.

-Ulf

> let me explain why i don't really care the singlebyte size,
> we have probably 100 singlebyte charsets in our repository, assume 
> each takes 2k, it's total of 200k of the 6M +(in stored mode)
> size of charsets.jar, even you can reduce the size to 0, it's 5% of 
> the total size. yes, each bit counts, but sometime you have to
> balance the advantage and disadvantage, so if we have to trade the 
> startup for the 5% reduce of total 6M charsets.jar, i would
> give it a second thought. but it might be a totaly different story for 
> doublebyte, if you can cut the 6M in half (that was my goal),
> with relatively small  startup regression, it might be something worth 
> doing.
>
> Sherman
>
>