Unicode script support in Regex and Character class
Ulf Zibis
Ulf.Zibis at gmx.de
Tue Apr 27 21:50:02 UTC 2010
Am 27.04.2010 19:03, schrieb Xueming Shen:
> Ulf Zibis wrote:
>>
>> I'm wondering, as script.txt only has ~120k.
>
> Ulf, you know we are not talking about Unicode scirpt but Unicode
> character name here, right?
> Unicode character name data is stored in UnicodeData.txt, you can find
> it at make/tools/UnicodeData.
Oop, thanks for solving my confusion. As UnicodeData.txt isn't part of
your webrev, I mixed the two.
>
>>
>> - and last but not least, use the given API's for byte code footprint
>> reduction as most as you can. Give good programming example as
>> newbies tend to use API sources as first template for their own code.
>> Seeing API use cases helps to become familiar with the complexity of
>> the Java-API. (Same for Arrays.binarySearch())
>>
> Good advice. I will keep it for the rest of my career:-)
Thanks for your humour. :-D
>
>>
>>
>> Additionally:
>> - No need to compare iis != null in finally block, possible NPE would
>> be thrown earlier.
> Maybe I'm paranoid but the check in finally block is for the scenario
> that the getResourceAsStream()
> fails unexpectedly, for example, the uniName.dat is missing, in that
> case the iis might be null. And the
> corresponding exception has already been caught in my catch block
> already. The current impl simply
> print out the exception stack trace. The alternative might be throw a
> fatal error.
Yes, my assumption was, that if the getResourceAsStream() fails, an
exception would be raised, or iis would be null, so referencing it would
raise an NPE anyway, so both should be caught before the final clause
would come to account. Maybe I'm wrong with that.
>
>> -
>> Do I understand right, that j.l.Ch.getName('5') would return:
>> "Basic Latin 35"
>> ... but j.l.Ch.getName('0') would return:
>> "DIGIT ZERO..DIGIT NINE"
>> I think both should return:
>> "DIGIT ZERO..DIGIT NINE" (otherwise we don't have to cache that
>> value ;-) )
>> or at least:
>> "Basic Latin U+0035"
> j.l.Ch.getName('0') returns
> DIGIT ZERO
>
> j.l.Ch.getName('5') returns
> DIGIT FIVE
>
> The name comes from
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. You need to
> convince the Unicode consortium if you prefer anything else:-)
Confusion caused from my mix-up above, thanks.
-Ulf
More information about the core-libs-dev
mailing list