Unicode script support in Regex and Character class

Xueming Shen xueming.shen at oracle.com
Sat May 8 21:49:22 UTC 2010


Hi,

The API  proposals for Unicode script support below have been approved.

6945564: Unicode script support in Character class
6948903: Make Unicode scripts available for use in regular expressions

Here is the final webrev ready for push.

http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev

(1) It is suggested that the access to the UnicodeScript and 
UnicodeBlock's ranges data might
be desirable for certain use  scenario, for example our regex engine 
might benefit from such
access to avoid runime binary search for each/every matching operation. 
I'm considering to
add a pair of UnicodeScript.is(codePoint) & UnicdeBlock.is(codePoint) to 
address this issue,
but prefer to handle it in a separate RFE (it seems like it's a 
no-brainer for UnicodeBlock, but
tricky for the UncodeScript, given its wide ranges of lots scripts, any 
suggestion? or
alternative?).

(2)Testing result suggests there is not too much runtime benefit of 
keeping a huge string
data pool + an access hashmap for getName() implementation. The latest 
implementation now
takes Ulf's suggestion to keep a relatively small byte[] pool and 
generate the names at runtime.
(there is "even smaller" implementation, which consumes about 300K 
memory at runtime
http://cr.openjdk.java.net/~sherman/script/webrev.00/
but it has a "scalability" problem need to address when string pool 
grows beyond 64k and it
is little slow)

(3)The UnicodeScript implementation is built on Unicode 5.2 Script.txt. 
The rest of the Character
class however is still using the previous version waiting for Yuka's 
Unicode 5.2 RFE to get
back in.

(4)The previous webrev can be found at 
http://cr.openjdk.java.net/~sherman/scripte

Please help review.

Thanks,
-Sherman





More information about the core-libs-dev mailing list