Unicode script support in Regex and Character class

Ulf Zibis Ulf.Zibis at gmx.de
Thu Apr 22 13:50:30 UTC 2010


Am 22.04.2010 10:01, schrieb Xueming Shen:
> Hi,
>
> Here is the webrev of the proposal to add Unicode script support in 
> regex and j.l.Character.
>
> http://cr.openjdk.java.net/~sherman/script/webrev
>
> and the corresponding blenderrev
>
> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
>
> Please comment on the APIs before I submit the CCC, especially

- I like the idea, saving the data in a compressed binary file, instead 
classfile static data.
- wouldn't PreHashMaps be faster initialized as a normal HashMaps in 
j.l.Character.UnicodeScript and j.l.CharacterName?
- As alternative to lookup in a hash table, I guess retrieving the 
pointers from a memory saving sorted array via binary search would be 
fast enough.
- j.l.CharacterName:
-- You could instantiate the HashMap with capacity=cpLeng
-- Is it faster, first copying the whole date in a byte[], and then 
using ByteBuffer.getInt etc. against directly using DataInputStream methods?
-- You could create a very long String with the whole data and then use 
subString for the individual strings which could share the same backing 
char[].
-- I don't think, it's a good idea, holding the whole data in memory, 
especiallly as String objects; Additionally the backing char[]'s occupy 
twice the space than a byte[]
-- the big new byte[total] and later the huge amount of String objects 
could result in OOM error on small VM heap.
-- as compromise, you could put the cp->nameOff pointers in a separate 
not-compressed data file, only hold this in memory, or access it via 
DirectByteBuffer, and read the string data from separate file only on 
request from Character.getName(int codePoint). As option, a PreHashMap 
could cache individual loaded strings.
-- Anyway, having DirectByteBuffer access on deflated data would be a 
performace/footprint gain.
- enum j.l.Character.UnicodeScript:
-- IIRC, enums internally are handled as int constants, so retrieving an 
element via name would need a name->int lookup
-- So UnicodeScript.forName would have to lookup 2 times
--- alias->fullName (name of enum element)
--- fullName->internal int constant
-- I suggest to add the full names to the aliasses map.
-- Why don't you use Arrays.binarySearch in UnicodeScript.of(int 
codePoint) ?


>
> (1) to use enum for the j.l.Character.UnicodeScript (compared to the 
> traditional j.l.c.Subset)

- enum j.l.Character.UnicodeScript:
-- IIRC, enums internally are handled as int constants, so retrieving an 
element via name would need a name->int lookup
-- So UnicodeScript.forName would have to lookup 2 times
--- alias->fullName (name of enum element)
--- fullName->internal int constant
-- I suggest to add the full names to the aliasses map and only lookup once.
-- Why don't you use Arrays.binarySearch in UnicodeScript.of(int 
codePoint) ?


> (2) the piggyback method j.l.c.getName() :-)
> (3) the syntax for script constructs. In addition to the "normal"
>     \p{InScriptName} and \P{InScriptName} for the script support
>     I'm also adding
>    \p{script=ScriptName} \P{script=ScriptName}  for the new script 
> support
>    \p{block=BlockName} \P{block=BlockName}  for the "existing" block 
> support
>    \p{general_category=CategoryName} \P{general_category=CategoryName} 
> for the "existing" gc
>    Perl recently also started to accept this  \p{propName=propValue} 
> Unicode style.
>    It opens the door for future "expanding", for example \p{name=XYZ} :-)

I'm missing \p{InScriptName} in Pattern javadoc.

-Ulf


> (4)and of course, the wording.
>
> Thanks,
> Sherman
>
>
>




More information about the core-libs-dev mailing list