Unicode script support in Regex and Character class

Fri Apr 23 23:09:23 UTC 2010

Ulf Zibis wrote:
>
> - I like the idea, saving the data in a compressed binary file, 
> instead classfile static data.
> - wouldn't PreHashMaps be faster initialized as a normal HashMaps in 
> j.l.Character.UnicodeScript and j.l.CharacterName?
I don't think so. The key for these 2 cases is the whole unicode range. 
But you can always try. Current
binary-search for script definitely is not a perfect solution.

> - As alternative to lookup in a hash table, I guess retrieving the 
> pointers from a memory saving sorted array via binary search would be 
> fast enough.
> - j.l.CharacterName:
> -- You could instantiate the HashMap with capacity=cpLeng
I changed the data file "format" a bit, so now the overal uniName.dat is 
less than 88k (last version is 122+k), but
the I can no long use cpLen as the capacity for the hashmap. I'm now 
using a hardcoded 20000 for 5.2.

> -- Is it faster, first copying the whole date in a byte[], and then 
> using ByteBuffer.getInt etc. against directly using DataInputStream 
> methods?
> -- You could create a very long String with the whole data and then 
> use subString for the individual strings which could share the same 
> backing char[].
> -- I don't think, it's a good idea, holding the whole data in memory, 
> especiallly as String objects; Additionally the backing char[]'s 
> occupy twice the space than a byte[]
> -- the big new byte[total] and later the huge amount of String objects 
> could result in OOM error on small VM heap.
> -- as compromise, you could put the cp->nameOff pointers in a separate 
> not-compressed data file, only hold this in memory, or access it via 
> DirectByteBuffer, and read the string data from separate file only on 
> request from Character.getName(int codePoint). As option, a PreHashMap 
> could cache individual loaded strings.
> -- Anyway, having DirectByteBuffer access on deflated data would be a 
> performace/footprint gain.
>
Sorry, I don't think I fully understand your points here.

I believe you would NOT see any meaningful performance boost from using 
DirectByteBuffer, given the
size of the data file, 88k. It probably will slow it down a little.

If you take a look at the last version
http://cr.openjdk.java.net/~sherman/script/webrev/src/share/classes/java/lang/CharacterName.java.html
You probably will not consider to use DataInputStream class. I no longer 
store the code point value for
most entries, one the length of the name, in which 1 byte is definitely 
big enough.

Yes, the final table takes about 500k, we might consider to use a 
weakref or something, if memory really
a concern. But the table will get initialized only if you invoke 
Character.getName(), I would expect most
of the application would never get down there.

>
>>
>> (1) to use enum for the j.l.Character.UnicodeScript (compared to the 
>> traditional j.l.c.Subset)
>
> - enum j.l.Character.UnicodeScript:
> -- IIRC, enums internally are handled as int constants, so retrieving 
> an element via name would need a name->int lookup
> -- So UnicodeScript.forName would have to lookup 2 times
> --- alias->fullName (name of enum element)
> --- fullName->internal int constant
> -- I suggest to add the full names to the aliasses map and only lookup 
> once.
Not really. It's not alias->fullName, it's alias->UnicodeScript costant. 
So if the passed in is an alias, then
we don't do the second lookup. That said, it's always a trade-off of 
memory use and speed. To put all
full name in aliases map definitely will reduce the second lookup if the 
passed in is a canonical name, with
the price of having name entries in both alias map and enum's internal 
hashmap. I really don't know which
one is a better choice. I did it this way with the assumption the lookup 
for script name is not critical. I
might be wrong.

> -- Why don't you use Arrays.binarySearch in UnicodeScript.of(int 
> codePoint) ?
>
>

why? I don't know:-) Maybe the copy/paste from UnicodeBlock lookup is 
more convenient than using
the Arrays.binarySearch. Not a big deal.

Thanks,
-Sherman