Unicode script support in Regex and Character class

Ulf Zibis Ulf.Zibis at gmx.de
Tue Apr 27 01:03:04 UTC 2010


Am 27.04.2010 00:01, schrieb Xueming Shen:
> Ulf Zibis wrote:
>> I would like to see the full names redundantly in the aliases map. 
>> Needs only ~100 * (4 + 4) bytes in HashMap<String, Character>.
> This is the implementation details, we can defer the difference for now.

I said that with the alternative of UnicodeScript as _normal class_ in 
my head, if saving the redundant internal hash map should matter.

>
>> UnicodeScript>.
>> I think there should be some more words in the javadoc about 
>> correlation/usecase/advantage of UnicodeScript against against 
>> UnicodeBlock.
>
> Martin raised the same comment. But I still believe 
> j.l.C.UnicodeScript simply defines the syntax of the Unicode script name
> in the Java libraries, it does not try to interpret/implement anything 
> further at semantics level. It just serves as a ID to the
> Unicode script name, so it'd be better to leave the semantics 
> definition/explanation to the TR#24.

Yes, for the semantics definition/explanation of Unicode script name, 
user should refer to the TR#24.
But he might like to be briefly informed about the different 
semantic/usecase/disadvantage of UnicodeBlock

>
>
>> I would like to have the 3 special cases INHERITED, COMMON and 
>> UNKNOWN together at the beginning or end of the enum list.
>
> Why?  Since the current list is generated by the script from the 
> Scripts.txt, it's in the order of what
> they are in the Scripts.txt, any particular reason they should be 
> listed differently? We do have the
> links at the beginning already. I don't see any advantage of putting 
> them physically together.

Someone might find it useful to code for example
     if (script < UnicodeScript.LATIN)
to easily filter the special cases.
Same might be considered for SURROGATE, PRIVATE_USE, UNASSIGNED.

-Ulf





More information about the core-libs-dev mailing list