<i18n dev> Unicode support in class Character
Tom Christiansen
tchrist at perl.com
Thu Jan 20 13:18:22 PST 2011
Sherman wrote:
> So even certain Unicode Properties are not yet supported by
> Java RegEx, it does not means they are not supported by the
> platform, you should be able to access those Unicode properties
> via java.lang.Character class.
Sherman, you're 100% right about that.
One case in point the bidirectional European number separator
property of characters. That property is available in Java via
Character.getDirectionality(int codePoint)
== DIRECTIONALITY_EUROPEAN_NUMBER_SEPARATOR
That property is not available for use in regexes, meaning
that you can use neither the long form
\p{Bidi_Class=European_Separator}
nor the short form
\p{Bc=ES}
within your patterns. This is not necessarily a show-shopper,
although it does constrain the ways you approach these problems:
you cannot and must not use regular expressions on them.
That is not always a big deal, though.
With respect to the standard Java class Character alone --
without regarding to regular expressions at all -- please
compare the Unicode functionality provided by
http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html
over that provided by the (soon to be) standard
http://download.java.net/jdk7/docs/api/java/lang/Character.html
Because the ICU library has post-3.0 Unicode support not found in Java
proper, it is especially worth looking at closely. You "just" have to use
their UCharacter class to get it, not the the standard Character class.
I may be wrong about all this--I really wish I were!--but looking over
the many significant improvements in ICU's UCharacter class over the
standard Java Character class, it really and truly looks to me like
Java appears last fully considered Unicode way, way back at its UCD 3.0
release in the year 2000. That is a *very* long time ago in so-called
"Internet generations"!
I do not mean to give any offence in saying this: it's just what the
situation seems to be. Look it over and see whether you don't come to the
same conclusion. As I said, I wish I were wrong. I can point out specific
difference if you would like, but I think folks familiar with the
problem-space will spot them on their own readily enough.
What I most dearly love to see Java would be brought fully up to date
so that its basic Character class supports whatever the current Unicode
release happens to be. Wouldn't that be great?
I do understand that this is much too much work to be done by one person
alone. Or in a short timespan: I certainly don't think it should be
rushed. I believe it should be a *goal*, albeit in my humble opinion an
important goal. Time is marching on, and it will be easier to catch up
to future Unicode releases once Java catches up to whatever the current
Unicode release.
That is, I understand that Unicode 3.0 -> 6.0 is a big jump, one requiring
quite a bit of real work. But once that happens, something like Unicode
6.0 -> 6.1 should be much easier.
--tom
PS: I'm trying to keep these messages to under 100 lines each.
More information about the i18n-dev
mailing list