Unicode script support in Regex and Character class

Tue Apr 27 04:25:28 UTC 2010

Ulf Zibis wrote:
> Am 24.04.2010 01:09, schrieb Xueming Shen:
>>
>> I changed the data file "format" a bit, so now the overal uniName.dat 
>> is less than 88k (last version is 122+k), but
>> the I can no long use cpLen as the capacity for the hashmap. I'm now 
>> using a hardcoded 20000 for 5.2.
>
> Again, is 88k the compressed or the uncompressed size ?

Yes, it's the size of compressed data. Your smart "save one more byte" 
suggestion will save
400+byte, a tiny 0.5%, unfortunately:-)

>
>>> -- Is it faster, first copying the whole date in a byte[], and then 
>>> using ByteBuffer.getInt etc. against directly using DataInputStream 
>>> methods?
The current impl use neither ByteBuffer nor DataInputStream now, so no 
compare here.
Yes, to use DataInputStream will definitely makes code look better (no 
more those "ugly"
shifts), but it also will slow down thing a little since it adds one 
more layer. But speed
may not really a concern here.

>>> -- You could create a very long String with the whole data and then 
>>> use subString for the individual strings which could share the same 
>>> backing char[].
>
The disadvantage of using a big buffer String to hold everything then 
have the individual names to substring
from it is that it might simply break the softreference logic here. The 
big char[] will never been gc-ed as
long as there is still one single name object (substring-ed from it) is 
still walking around in system somewhere.
I don't think the vm/gc is that smart, isn't it?

But this will definitely be faster, given the burden of creating a 
String from bytes (we put in the optimization
earlier, so this operation should be faster now compared to 6u).

-Sherman