Unicode script support in Regex and Character class
Ulf Zibis
Ulf.Zibis at gmx.de
Thu Apr 29 18:07:05 UTC 2010
Am 24.04.2010 01:09, schrieb Xueming Shen:
>
> Yes, the final table takes about 500k, we might consider to use a
> weakref or something, if memory really
> a concern. But the table will get initialized only if you invoke
> Character.getName(),
Sherman, how did you compute that value:
- A Map.Entry object counts 24 bytes (40 on 64-bit machine)
- An Integer object for the key counts 12 bytes (20 on 64-bit machine)
- A String object counts 36 + 2*length, so for average character name
length of 24:
84 bytes (98 on 64-bit machine)
--> one character name in HashMap would count including buckets overhead
~135 bytes (~170 on 64-bit machine)
--> 20.000 character names would count ~2.7 MByte (~3.4 on 64-bit machine)
See my new version in attachment.
I estimate:
- for byte[] names: 480.000 bytes
- for int[][] indexes:
-- base array size with 4353 elements: 17.420 bytes
-- one int[] index for block with average length of 32: 140 bytes
-- sum: 626.700 bytes
over all sum: 1.106.700 bytes (pretty enough)
If the block offset would be smaller than 256, I guess it would be more
less. (with the impact of little decreased performance)
- Initializing the indexes array should be *much* faster than filling
the hash map.
- Retrieving an index should be little faster or equivalent, but
instantiation of one new String object must be added.
We could go further:
- separate caches (and data files) for the 17 Unicode planes
- calculate short 1/2-byte keys for textual words and frequent phrases.
I estimate, there are 1000..4000 different words and 100..300 redundant
phrases in the data.
Are you interested in that ?
We could add (Attention: CCC change) a cacheCharacterNames(boolean
yesNo) method to serve users, which excessively need this functionality.
-Ulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100429/946204eb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CharacterName2.java
Type: java/*
Size: 4401 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100429/946204eb/CharacterName2.java>
More information about the core-libs-dev
mailing list