Unicode script support in Regex and Character class

Tue May 11 14:11:58 UTC 2010

SOME of my comments below ARE ment for 
http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev

I marked the others. ;-)

-Ulf

Am 11.05.2010 02:05, schrieb Xueming Shen:
> Ulf,
>
> My apology for distracting you to that "smaller size alternative", as 
> I said in my previous email
> please only "review" the bits at
> http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev
>
> It's fine if you are interested in the stuff I experimented at
> http://cr.openjdk.java.net/~sherman/script/webrev.00
> but please keep it separated from the code I'm proposing to putback.
>
> -Sherman
>
>
> Ulf Zibis wrote:
>> Some additional thoughts:
>>
>> *EXPERIMENTAL*  - out.writeShort((short)(num & 0xffff)); ---short 
>> form--->  out.writeShort((short)num);
>> - use Arrays.binarySearch() in Character.UnicodeBlock.of().
>> *EXPERIMENTAL*  - "if (notFirst)" could be saved if you would first 
>> append the first word to sb outside the while loop.
>> *EXPERIMENTAL*  - StringBuilder sb could be initialized by the 
>> maximum name length (=83) to avoid resizing;
>> *EXPERIMENTAL*  - we could reuse the same Stringbuilder for multiple 
>> invokations of Character.getName(cp)?
>> *EXPERIMENTAL*  -- make CharacterName.get(cp) instance method and 
>> save CharacterName object as ThreadLocal from Character.getName(cp).
>> *EXPERIMENTAL*  -- synchronize Character.getName(cp).
>> *EXPERIMENTAL*  - Instead using StringBuilder we could use 
>> ByteBuffer, omit the char[] and build the final String by new 
>> String(bb.toArray(), "ASCII").
>> *EXPERIMENTAL*  -- saves the twice bigger char[] for the pool.
>> *EXPERIMENTAL*  -- I imagine, ByteBuffer would perform better than 
>> StringBuilder.
>> - save UnicodeBlocks, BlockStarts and scriptStarts in a file instead 
>> statically in classfile.
>> -- e.g. init of scriptStarts is a big waste of byte code (7/11 bytes 
>> per short/integer entry).
>>
>> Am 08.05.2010 23:49, schrieb Xueming Shen:
>>> Hi,
>>>
>>> The API  proposals for Unicode script support below have been approved.
>>>
>>> 6945564: Unicode script support in Character class
>>> 6948903: Make Unicode scripts available for use in regular expressions
>>>
>>> (2)Testing result suggests there is not too much runtime benefit of 
>>> keeping a huge string
>>> data pool + an access hashmap for getName() implementation. The 
>>> latest implementation now
>>> takes Ulf's suggestion to keep a relatively small byte[] pool and 
>>> generate the names at runtime.
>>> (there is "even smaller" implementation, which consumes about 300K 
>>> memory at runtime
>>> http://cr.openjdk.java.net/~sherman/script/webrev.00/
>>> but it has a "scalability" problem need to address when string pool 
>>> grows beyond 64k and it
>>> is little slow)
>>
>> I'm investigating in that.
>> For 1st, my string pool has size of only 35243.
>>
>> -Ulf
>>
>>
>>
>
>