Unicode script support in Regex and Character class
Xueming Shen
xueming.shen at oracle.com
Sat May 8 21:49:22 UTC 2010
Hi,
The API proposals for Unicode script support below have been approved.
6945564: Unicode script support in Character class
6948903: Make Unicode scripts available for use in regular expressions
Here is the final webrev ready for push.
http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev
(1) It is suggested that the access to the UnicodeScript and
UnicodeBlock's ranges data might
be desirable for certain use scenario, for example our regex engine
might benefit from such
access to avoid runime binary search for each/every matching operation.
I'm considering to
add a pair of UnicodeScript.is(codePoint) & UnicdeBlock.is(codePoint) to
address this issue,
but prefer to handle it in a separate RFE (it seems like it's a
no-brainer for UnicodeBlock, but
tricky for the UncodeScript, given its wide ranges of lots scripts, any
suggestion? or
alternative?).
(2)Testing result suggests there is not too much runtime benefit of
keeping a huge string
data pool + an access hashmap for getName() implementation. The latest
implementation now
takes Ulf's suggestion to keep a relatively small byte[] pool and
generate the names at runtime.
(there is "even smaller" implementation, which consumes about 300K
memory at runtime
http://cr.openjdk.java.net/~sherman/script/webrev.00/
but it has a "scalability" problem need to address when string pool
grows beyond 64k and it
is little slow)
(3)The UnicodeScript implementation is built on Unicode 5.2 Script.txt.
The rest of the Character
class however is still using the previous version waiting for Yuka's
Unicode 5.2 RFE to get
back in.
(4)The previous webrev can be found at
http://cr.openjdk.java.net/~sherman/scripte
Please help review.
Thanks,
-Sherman
More information about the core-libs-dev
mailing list