Unicode script support in Regex and Character class

Ulf Zibis Ulf.Zibis at gmx.de
Thu Apr 29 18:07:05 UTC 2010


Am 24.04.2010 01:09, schrieb Xueming Shen:
>
> Yes, the final table takes about 500k, we might consider to use a 
> weakref or something, if memory really
> a concern. But the table will get initialized only if you invoke 
> Character.getName(),

Sherman, how did you compute that value:
- A Map.Entry object counts 24 bytes (40 on 64-bit machine)
- An Integer object for the key counts 12 bytes (20 on 64-bit machine)
- A String object counts 36 + 2*length, so for average character name 
length of 24:
       84 bytes (98 on 64-bit machine)
--> one character name in HashMap would count including buckets overhead 
~135 bytes (~170 on 64-bit machine)
--> 20.000 character names would count ~2.7 MByte (~3.4 on 64-bit machine)


See my new version in attachment.

I estimate:
- for byte[] names: 480.000 bytes
- for int[][] indexes:
-- base array size with 4353 elements: 17.420 bytes
-- one int[] index for block with average length of 32: 140 bytes
-- sum: 626.700 bytes
over all sum: 1.106.700 bytes (pretty enough)
If the block offset would be smaller than 256, I guess it would be more 
less. (with the impact of little decreased performance)

- Initializing the indexes array should be *much* faster than filling 
the hash map.
- Retrieving an index should be little faster or equivalent, but 
instantiation of one new String object must be added.

We could go further:
- separate caches (and data files) for the 17 Unicode planes
- calculate short 1/2-byte keys for textual words and frequent phrases. 
I estimate, there are 1000..4000 different words and 100..300 redundant 
phrases in the data.
Are you interested in that ?

We could add (Attention: CCC change) a cacheCharacterNames(boolean 
yesNo) method to serve users, which excessively need this functionality.

-Ulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100429/946204eb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CharacterName2.java
Type: java/*
Size: 4401 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100429/946204eb/CharacterName2.java>


More information about the core-libs-dev mailing list