Unicode script support in Regex and Character class

Ulf Zibis Ulf.Zibis at gmx.de
Sat Apr 24 02:36:31 UTC 2010


Am 24.04.2010 01:09, schrieb Xueming Shen:
> Ulf Zibis wrote:
>>
>> - I like the idea, saving the data in a compressed binary file, 
>> instead classfile static data.
>> - wouldn't PreHashMaps be faster initialized as a normal HashMaps in 
>> j.l.Character.UnicodeScript and j.l.CharacterName?
> I don't think so. The key for these 2 cases is the whole unicode range.

At least the aliases map has string keys.

> But you can always try. Current
> binary-search for script definitely is not a perfect solution.

In most cases you don't have an exact match from the HashMap of 
CharacterName, so then you anyway have to do the binary search.

>
>> - As alternative to lookup in a hash table, I guess retrieving the 
>> pointers from a memory saving sorted array via binary search would be 
>> fast enough.
>> - j.l.CharacterName:
>> -- You could instantiate the HashMap with capacity=cpLeng
> I changed the data file "format" a bit, so now the overal uniName.dat 
> is less than 88k (last version is 122+k), but

Is this compressed size or un-compressed ?

> the I can no long use cpLen as the capacity for the hashmap. I'm now 
> using a hardcoded 20000 for 5.2.

You could pre-calculate the actual value by help of 
generatecharacter/CharacterName.java

>
> I believe you would NOT see any meaningful performance boost from 
> using DirectByteBuffer, given the
> size of the data file, 88k. It probably will slow it down a little.

If you read the whole file, yes, but retrieving a single data from a 
distinct position ?


>
> If you take a look at the last version
> http://cr.openjdk.java.net/~sherman/script/webrev/src/share/classes/java/lang/CharacterName.java.html 
>
> You probably will not consider to use DataInputStream class. I no 
> longer store the code point value for
> most entries, one the length of the name, in which 1 byte is 
> definitely big enough.

You could save one more byte:

   66             do {
   67                 int len = ba[off++]&  0xff;
   68                 if (len<  0x11) {
   69                     // always big-endian
   70                     cp = (len<<  16) |
   71                          ((ba[off++]&  0xff)<<   8) |
   72                          ((ba[off++]&  0xff));
   73                     len = ba[off++]&  0xff;
   74
   75                 }  else {
   76                     len -= 0x11;
   77                     cp++;
   78                 }

>
> Yes, the final table takes about 500k, we might consider to use a 
> weakref or something, if memory really
> a concern. But the table will get initialized only if you invoke 
> Character.getName(),

Yes, retrieving one single Character.getName() would cause the whole map 
to initialize. Is that economic?

> I would expect most
> of the application would never get down there.
>
>>
>>>
>>> (1) to use enum for the j.l.Character.UnicodeScript (compared to the 
>>> traditional j.l.c.Subset)
>>
>> - enum j.l.Character.UnicodeScript:
>> -- IIRC, enums internally are handled as int constants, so retrieving 
>> an element via name would need a name->int lookup
>> -- So UnicodeScript.forName would have to lookup 2 times
>> --- alias->fullName (name of enum element)
>> --- fullName->internal int constant
>> -- I suggest to add the full names to the aliasses map and only 
>> lookup once.
> Not really. It's not alias->fullName, it's alias->UnicodeScript 
> costant. So if the passed in is an alias, then
> we don't do the second lookup.
This I wanted to say, sorry about not being more detailed.

> That said, it's always a trade-off of memory use and speed. To put all
> full name in aliases map definitely will reduce the second lookup if 
> the passed in is a canonical name, with
> the price of having name entries in both alias map and enum's internal 
> hashmap.

~100 * (4 + 4) bytes against the above 500.000 bytes, does that matter ?

> I really don't know which
> one is a better choice. I did it this way with the assumption the 
> lookup for script name is not critical. I
> might be wrong.
>
>
>> -- Why don't you use Arrays.binarySearch in UnicodeScript.of(int 
>> codePoint) ?
>>
>>
>
> why? I don't know:-) Maybe the copy/paste from UnicodeBlock lookup is 
> more convenient than using
> the Arrays.binarySearch. Not a big deal.

So both could use Arrays.binarySearch ;-)

-Ulf






More information about the core-libs-dev mailing list