Unicode script support in Regex and Character class

Ulf Zibis Ulf.Zibis at gmx.de
Fri Apr 30 01:31:46 UTC 2010

I have corrected the statistics:

current code from Sherman:
- A Map.Entry object counts 24 bytes (40 on 64-bit machine)
- An Integer object for the key counts 12 bytes (20 on 64-bit machine)
- A String object counts 36 + 2*length, so for average character name 
length of 26:
       88 bytes (102 on 64-bit machine)
--> one character name in HashMap would count including buckets overhead 
~140 bytes (~170 on 64-bit machine)
--> 20.000 character names would count ~2.8 MByte (~3.4 on 64-bit machine)

Measures, resulting from my latest version:
- for byte[] names: 509.554 bytes
- for int[][] indexes:
-- base array size with 4353 elements: 17.420 bytes
-- one int[] index for block with average length of 220: 892 bytes
-- sum: 17.420 + 97* 892 bytes = 103.944 bytes
over all sum: 613.498 bytes (pretty enough)

I did some statistics:

total name bytes: 509554
total names count: 19336
average name length: 26,35
used blocks: 97
average block length: 220,89
total words count: 7352
total words chars: 37427
     word  LETTER               occurs 5777 times
     word  WITH                 occurs 2271 times
     word  SMALL                occurs 2088 times
     word  SYLLABLE             occurs 1991 times
     word  SIGN                 occurs 1853 times
     word  CAPITAL              occurs 1578 times
     word  LATIN                occurs 1261 times
     word  YI                   occurs 1236 times
     word  CJK                  occurs 1160 times
     word  IDEOGRAPH            occurs 1094 times
     word  ARABIC               occurs 1047 times
     word  COMPATIBILITY        occurs 1009 times
     word  MATHEMATICAL         occurs 1007 times
     word  CUNEIFORM            occurs 982 times
     word  SYMBOL               occurs 962 times
     word  FORM                 occurs 798 times
     word  SYLLABICS            occurs 630 times
     word  CANADIAN             occurs 630 times
     word  BOLD                 occurs 567 times
     word  GREEK                occurs 517 times
     word  AND                  occurs 508 times
     word  LIGATURE             occurs 508 times
     word  DIGIT                occurs 506 times
     word  MUSICAL              occurs 498 times
     word  TIMES                occurs 492 times
     word  ETHIOPIC             occurs 461 times
     word  HANGUL               occurs 446 times
     word  ITALIC               occurs 423 times
     word  CYRILLIC             occurs 403 times
     word  RADICAL              occurs 385 times
     word  ABOVE                occurs 379 times
     word  SANS                 occurs 368 times
     word  -SERIF               occurs 368 times
     word  VOWEL                occurs 357 times
     word  ARROW                occurs 338 times
     word  DOTS                 occurs 328 times
     word  RIGHT                occurs 326 times
     word  FOR                  occurs 321 times
     word  LEFT                 occurs 316 times
     word  CIRCLED              occurs 312 times
     word  DOUBLE               occurs 308 times
     word  SQUARE               occurs 308 times
     word  VAI                  occurs 300 times
     word  FINAL                occurs 295 times
     word  COMBINING            occurs 293 times
     word  A                    occurs 284 times
     word  B                    occurs 277 times
     word  U                    occurs 269 times
     word  VARIATION            occurs 260 times
     word  SELECTOR             occurs 259 times
     word  PATTERN              occurs 257 times
     word  BRAILLE              occurs 256 times
     word  BYZANTINE            occurs 246 times
     word  O                    occurs 236 times
     word  ISOLATED             occurs 236 times
     word  VERTICAL             occurs 228 times
     word  BELOW                occurs 227 times
     word  DOT                  occurs 227 times
     word  KATAKANA             occurs 222 times
     word  MARK                 occurs 218 times
     word  E                    occurs 216 times
     word  KANGXI               occurs 214 times
     word  LINEAR               occurs 211 times
     word  MODIFIER             occurs 207 times
     word  TIBETAN              occurs 201 times
     word  TWO                  occurs 200 times
     word  I                    occurs 199 times
     word  STROKE               occurs 196 times
     word  MEEM                 occurs 192 times
     word  INITIAL              occurs 177 times
     word  WHITE                occurs 177 times
     word  CARRIER              occurs 175 times
     word  YEH                  occurs 174 times
     word  TO                   occurs 173 times
     word  BLACK                occurs 173 times
     word  ONE                  occurs 165 times
     word  NUMBER               occurs 160 times
     word  MONGOLIAN            occurs 156 times
     word  MYANMAR              occurs 156 times
     word  THREE                occurs 154 times
     word  HOOK                 occurs 152 times
     word  COPTIC               occurs 150 times
     word  KHMER                occurs 146 times
     word  TILE                 occurs 145 times
     word  BOX                  occurs 143 times
     word  PLUS                 occurs 142 times
     word  HORIZONTAL           occurs 137 times
     word  BRACKET              occurs 135 times
     word  HEBREW               occurs 133 times
     word  RIGHTWARDS           occurs 131 times
     word  OF                   occurs 128 times
     word  UP                   occurs 128 times
     word  DRAWINGS             occurs 128 times
     word  KA                   occurs 127 times
     word  ALEF                 occurs 126 times
     word  DOWN                 occurs 125 times
     word  OLD                  occurs 124 times
     word  HALFWIDTH            occurs 122 times
     word  FOUR                 occurs 121 times
     word  GEORGIAN             occurs 121 times
     word  BAR                  occurs 121 times
     word  BALINESE             occurs 121 times
     word  -THAN                occurs 120 times
     word  -CREE                occurs 119 times
     word  L                    occurs 117 times
     word  R                    occurs 117 times
     word  IDEOGRAM             occurs 117 times
     word  HEAVY                occurs 117 times
     word  EQUAL                occurs 115 times
     word  TAI                  occurs 115 times
     word  IDEOGRAPHIC          occurs 115 times
     word  WEST                 occurs 113 times
     word  PARENTHESIZED        occurs 113 times
     word  N                    occurs 112 times
     word  DEVANAGARI           occurs 112 times
     word  FIVE                 occurs 110 times
     word  SCRIPT               occurs 109 times
     word  TAG                  occurs 105 times
     word  HAH                  occurs 104 times
     word  FULLWIDTH            occurs 103 times
     word  TILDE                occurs 101 times
     word  OVER                 occurs 101 times
     word  LIGHT                occurs 100 times
     word  CHARACTER            occurs 100 times
     word  DOMINO               occurs 100 times
     word  NUMERIC              occurs 99 times
     word  LEFTWARDS            occurs 99 times
     word  FRAKTUR              occurs 99 times
     word  HALF                 occurs 98 times
     word  S                    occurs 97 times
     word  MALAYALAM            occurs 95 times
     word  GLAGOLITIC           occurs 94 times
     word  C                    occurs 93 times
     word  JEEM                 occurs 93 times
     word  TELUGU               occurs 93 times
     word  MEDIAL               occurs 91 times
     word  CHOSEONG             occurs 91 times
     word  ACUTE                occurs 91 times
     word  ARMENIAN             occurs 91 times
     word  BENGALI              occurs 91 times
     word  TONE                 occurs 90 times
     word  OR                   occurs 89 times
     word  HIRAGANA             occurs 89 times
     word  HA                   occurs 87 times
     word  THAI                 occurs 87 times
     word  Z                    occurs 86 times
     word  CIRCLE               occurs 86 times
     word  KANNADA              occurs 86 times
     word  Y                    occurs 85 times
     word  CHEROKEE             occurs 85 times
     word  EIGHT                occurs 84 times
     word  ORIYA                occurs 84 times
     word  GUJARATI             occurs 83 times
     word  CHAM                 occurs 83 times
     word  SIX                  occurs 83 times
     word  DASIA                occurs 83 times
     word  JONGSEONG            occurs 82 times
     word  M                    occurs 81 times
     word  H                    occurs 81 times
     word  T                    occurs 81 times
     word  SAURASHTRA           occurs 81 times
     word  TETRAGRAM            occurs 81 times
     word  RUNIC                occurs 81 times
     word  NEW                  occurs 81 times
     word  DESERET              occurs 80 times
     word  SINHALA              occurs 80 times
     word  LUE                  occurs 80 times
     word  D                    occurs 79 times
     word  G                    occurs 79 times
     word  V                    occurs 79 times
     word  NOTATION             occurs 79 times
     word  SYRIAC               occurs 79 times
     word  CIRCUMFLEX           occurs 79 times
     word  PSILI                occurs 79 times
     word  GURMUKHI             occurs 79 times
     word  SEVEN                occurs 78 times
     word  NINE                 occurs 77 times
     word  VOCALIC              occurs 77 times
     word  LONG                 occurs 74 times
     word  LINE                 occurs 74 times
     word  LEPCHA               occurs 74 times
     word  K                    occurs 73 times
     word  DIAERESIS            occurs 73 times
     word  -STRUCK              occurs 72 times
     word  HAMZA                occurs 72 times
     word  TAMIL                occurs 72 times
     word  APL                  occurs 70 times
     word  FUNCTIONAL           occurs 70 times
     word  TELEGRAPH            occurs 69 times
     word  MAKSURA              occurs 69 times
     word  MACRON               occurs 68 times
     word  ALPHA                occurs 68 times
     word  GRAVE                occurs 68 times
     word  P                    occurs 67 times
     word  OMEGA                occurs 67 times
     word  ACCENT               occurs 67 times
     word  JUNGSEONG            occurs 67 times
     word  LIMBU                occurs 66 times
     word  BARB                 occurs 66 times
     word  TRIANGLE             occurs 66 times
     word  LOW                  occurs 66 times
     word  KHAROSHTHI           occurs 65 times
     word  BOPOMOFO             occurs 65 times
     word  LAO                  occurs 65 times
     word  NOT                  occurs 65 times
     word  RA                   occurs 64 times
     word  YA                   occurs 64 times
     word  HEXAGRAM             occurs 64 times
     word  HARPOON              occurs 64 times
     word  TA                   occurs 63 times
     word  REVERSED             occurs 63 times
     word  X                    occurs 62 times
     word  ANGLE                occurs 62 times
     word  MA                   occurs 62 times
     word  HIGH                 occurs 62 times
     word  MONOSPACE            occurs 62 times
     word  OXIA                 occurs 62 times
     word  VARIA                occurs 62 times
     word  GREATER              occurs 62 times
     word  J                    occurs 61 times
     word  PA                   occurs 61 times
     word  LI                   occurs 61 times
     word  KHAH                 occurs 61 times
     word  LAGAB                occurs 61 times
     word  LESS                 occurs 61 times
     word  W                    occurs 59 times
     word  LA                   occurs 59 times
     word  LOWER                occurs 59 times
     word  NKO                  occurs 59 times
     word  NUMERAL              occurs 58 times
     word  LAM                  occurs 58 times
     word  TURNED               occurs 58 times
     word  F                    occurs 57 times
     word  DA                   occurs 57 times
     word  AEGEAN               occurs 57 times
     word  SHORT                occurs 57 times
     word  GA2                  occurs 56 times
     word  PHAGS                occurs 56 times
     word  OPEN                 occurs 56 times
     word  NA                   occurs 56 times
     word  ETA                  occurs 56 times
     word  -PA                  occurs 56 times
     word  STOP                 occurs 56 times
     word  SUNDANESE            occurs 55 times
     word  CYPRIOT              occurs 55 times
     word  BREVE                occurs 55 times
     word  TIFINAGH             occurs 55 times
     word  IOTA                 occurs 54 times
     word  ACROPHONIC           occurs 53 times
     word  SA                   occurs 53 times
     word  PERSIAN              occurs 53 times
     word  ZERO                 occurs 53 times
     word  UPPER                occurs 53 times
     word  ROMAN                occurs 52 times
     word  SUBJOINED            occurs 52 times
     word  NOON                 occurs 52 times
BUILD SUCCESSFUL (total time: 14 seconds)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100430/ba319e9c/attachment.html>

More information about the core-libs-dev mailing list