Unicode script support in Regex and Character class

Xueming Shen xueming.shen at oracle.com
Tue Apr 27 17:03:52 UTC 2010


Ulf Zibis wrote:
> Am 27.04.2010 06:25, schrieb Xueming Shen:
>> Ulf Zibis wrote:
>>> Am 24.04.2010 01:09, schrieb Xueming Shen:
>>>>
>>>> I changed the data file "format" a bit, so now the overal 
>>>> uniName.dat is less than 88k (last version is 122+k), but
>>>> the I can no long use cpLen as the capacity for the hashmap. I'm 
>>>> now using a hardcoded 20000 for 5.2.
>>>
>>> Again, is 88k the compressed or the uncompressed size ?
>>
>> Yes, it's the size of compressed data.
>
> I'm wondering, as script.txt only has ~120k.

Ulf, you know we are not talking about Unicode scirpt but Unicode 
character name here, right?
Unicode character name data is stored in UnicodeData.txt, you can find 
it at make/tools/UnicodeData.

>
> - and last but not least, use the given API's for byte code footprint 
> reduction as most as you can. Give good programming example as newbies 
> tend to use API sources as first template for their own code. Seeing 
> API use cases helps to become familiar with the complexity of the 
> Java-API. (Same for Arrays.binarySearch())
>
Good advice. I will keep it for the rest of my career:-)

>
>
> Additionally:
> - No need to compare iis != null in finally block, possible NPE would 
> be thrown earlier.
Maybe I'm paranoid but the check in finally block is for the scenario 
that the getResourceAsStream()
fails unexpectedly, for example, the uniName.dat is missing, in that 
case the iis might be null. And the
corresponding exception has already been caught in my catch block 
already. The current impl simply
print out the exception stack trace. The alternative might be throw a 
fatal error.

> -
>   Do I understand right, that j.l.Ch.getName('5') would return:
>       "Basic Latin 35"
>   ... but j.l.Ch.getName('0') would return:
>       "DIGIT ZERO..DIGIT NINE"
>   I think both should return:
>       "DIGIT ZERO..DIGIT NINE" (otherwise we don't have to cache that 
> value ;-) )
>   or at least:
>       "Basic Latin U+0035"
j.l.Ch.getName('0')  returns
DIGIT ZERO

j.l.Ch.getName('5')  returns
DIGIT FIVE

The name comes from 
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. You need to
convince the Unicode consortium if you prefer anything else:-)

I would review other coding style issue later.

-Sherman








More information about the core-libs-dev mailing list