<i18n dev> RL1.1 Hex Notation (part 3 of 3)

Tom Christiansen tchrist at perl.com
Sat Jan 22 21:41:36 PST 2011


Sherman wrote:

> As of the Unicode support in j.l.Character class,

>> What I most dearly love to see Java would be brought fully up to date
>> so that its basic Character class supports whatever the current Unicode
>> release happens to be.  Wouldn't that be great?

> Java language specification clearly specifies in [2] that Java platform
> tracks Unicode specification as it evolves. The up coming JDK7 will base
> its character data on Unicode 6.0.  So Java platform IS fully up to date to
> the Unicode Standard, as its specification requires, but it does not
> necessarily mean it has to support "whatever" the Unicode offers, added in
> new releases.

Yours there is one those things that I find tricky to understand.

Being "fully up to date with the Unicode standard" appears to mean
different things to the two of us.   What quite specifically does 
it mean to you?

I've just spent most of the day reading up on conformance issues.
There is quite a bit in the Unicode 6.0 conformance chapter:

    http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

One thing I noticed there is that the first thing in section
§3.2 Conformance Requirements is something I brought up in part 2:

    C1 A process shall not interpret a high-surrogate code point 
       or a low-surrogate code point as an abstract character.

That *seems* to agree with me that it is incorrect for the regex engine to
allow either a high or a low surrogate in isolation to match as though it
were an abstract character.  If you recall, my example was that I did not
feel that Java should allow "\uD83D" to match "^.$".  

However, there may be *some* wiggle-room here.  At least, it is not
completely obvious to me.  See the long discussion under C10 about what
to do with code unit sequences that are ill-formed for a particular
encoding form.

In any event, we're getting to the important part now: what the Unicode
Standard really means.  There is certainly a heck of a lot of stuff in
Unicode, and just because a platform doesn't implement every little bit of
it does *not* mean that that platform is somehow non-conformant.  Perhaps
that's what you were saying, Sherman.

The next message will be mostly about tr18's RL1.2 Properties, which is 
where I feel the most serious problems exist.  There are two main problems:

  #1: Several key properties required for a comforming
      implementation are missing.  

  #2: You use certain Unicode-defined property names but assign them
      meanings different than what Unicode says you must give them.

--tom



More information about the i18n-dev mailing list