<i18n dev> RL1.1 Hex Notation

Xueming Shen xueming.shen at oracle.com
Fri Jan 21 12:25:14 PST 2011


Tom,

Introducing in the new perl style \x{...} as the hexadecimal notation 
appears to be
a nice-to-have enhancement (I will file a RFE to put this request in 
record). But I
don't think you can simply deny that the Java Unicode escape sequences 
for UTF16
is NOT A "mechanism"/notation for specifying any Unicode code point in 
Java RegEx,
in which two consecutive Unicode escapes that represent a legal utf16 
surrogate pair
are interpreted as the corresponding supplementary code point.

The tr#18 explains the purpose of having the hex notation requirement as 
"The
character set used by the regular expression writer may not be Unicode, 
or may not
have the ability to input all Unicode code points from a keyboard.", as 
long as the
notation mechanism provided by the Java RegEx can serve this purse, 
might not
be as perfect/direct in some cases, as you prefer to, I would not 
conclude that
Java RegEx can not claim "conformance" to the TR.

Regarding to your comment
-------------------------------------------------

But that is in clear violation of what Level 1 must provide:

     Level 1: Basic Unicode Support. At this level, the regular expression
     engine provides support for Unicode characters as basic logical units.
     (This is independent of the actual serialization of Unicode as UTF-8,
     UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for
     useful Unicode support.

---------------------------------------------------

My interpretation of above note is that the in order to claim "basic 
unicode support" the regex
engine need to handle each Unicode character as a basic logical unit 
(code point), no matter
what its underlying/internal representation is. In case of UTF16, which 
is used by Java String
as its internal form, it means the regex engine needs to work on 
surrogate pair for supplementary
character, instead of treating them as two separate surrogates. This is 
what Java RegEx engine
does, in fact the "first thing" (after normalizing the pattern, if 
required) the engine does is to
"translate" the input regex pattern from String (utf16) into code point 
form in a int[], each int in
the array represents a Unicode code point vlaue, internally the engine 
works on code point
vlaue. (if you use double backslash to by-pass the javac compiler 
interpretation, the surrogate
pair to code point conversion will happen a little later at node-tree 
build stage, we might have
a bug in earlier releases, but it should have been fixed in 7, if not 
jdk6). So yes, Java RegEx
engine works on Unicode code point (as the logical unit) not UTF16 code 
unit.

As of the Unicode support in j.l.Character class,

What I most dearly love to see Java would be brought fully up to date
so that its basic Character class supports whatever the current Unicode
release happens to be.  Wouldn't that be great?

Java language specification clearly specifies in [2] that Java platform 
tracks Unicode
specification as it evolves. The up coming JDK7 will base its character 
date on Unicode 6.0.
So Java platform IS fully up to date to the Unicode Standard, as its 
specification requires,
but it does not  necessarily mean it has to support "whatever" the 
Unicode offers, added
in new releases. The j.l.Character class has been evolving during the 
years to add more
and more Unicode support that people agreed that are most useful for 
Java developer
(Java API doc has a "since x.y" notation for each method to indicate 
which release it is
added), but again, it does not mean we are going to add "All" as you 
suggested, lots of
factors need to be evaluated to decide if something should be added into 
the core library,
or better leave to third-party/specialist package to handle. If you have 
anything specific
you believe should be in j.l.Character but not there yet, please file a 
RFE, we can start
from there.

-Sherman


[1] 
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#100850
[2]http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#95413

On 1-21-2011 7:16 AM, Tom Christiansen wrote:
> Here's the first requirement that must be met to claim Level 1 compliance.
>
> Java does not yet meet this requirement, but it could easily do so: indeed,
> my own regex-rewriting library implements this requirement.  It takes very
> little code at all and is *completely* backwards compatible because it uses
> a syntax that was previously illegal.
>
> This guarantees there can have been no existing valid Java regex whose
> behaviour would be altered by this particular implementation of the fix.
>
> I personally believe this property of not changing how existing code
> behaves to be critical--although I also recognize alternate viewpoints that
> either consider it to be merely "very important" instead of "critical",
> or indeed which instead hold it to be an "indispensable requirement".
>
> Details below.
>
>     http://www.unicode.org/reports/tr18/#Hex_notation
>
>     +--------------------------
>     | RL1.1 Hex Notation
>     |
>     | To meet this requirement, an implementation shall supply a mechanism
>     | for specifying any Unicode code point (from U+0000 to U+10FFFF)
>     +--------------------------
>
> Java allows you to specify code points from U+0000 to U+FFFF, not to
> U+10FFFF.  You can specify code points only from Plane 0, not code points
> from Plane 1-16.  This is true of both the lexical substitution prepass
> in which "\uXXXX" is replaced by the corresponding code point as well
> as during the regex compilation itself, during which "\\uXXXX" is recognized
> as a code point from Plane 0.
>
> Note that the first of those two, the lexical analysis phase, is insufficient
> to meet the requirement even for Plane 0, and so cannot be counted.  This
> is because one must be able to specify *ANY* Unicode code point, but you
> cannot do that with \uXXXX even in Plane 0 because certain code points
> are forbidden, such as \u000A for line feed, \u0022 for double quote, and
> \u005C for backslash.  Those are all treated as syntactic elements of Java
> itself, not as the listed Plane 0 code pint.
>
> Which leaves us with the \\uXXXX notation.  That indeed allows
> you to specify any code point *IN PLANE 0*.  It does not permit
> you to specify any  Unicode code point directly per the requirement.
>
> Specifying single Unicode code points using indirect UTF-16 notation no
> more suffices for this requirement than doing so using indirect UTF-8
> notation.  Imagine if one had to specify Unicode code points using UTF-8
> instead of UTF-16; I'll use the "\xXX" notation for this.  That means that
> U+A3, POUND SIGN, would need to be specified as "\xC2\xA3".
>
> But that is in clear violation of what Level 1 must provide:
>
>      Level 1: Basic Unicode Support. At this level, the regular expression
>      engine provides support for Unicode characters as basic logical units.
>      (This is independent of the actual serialization of Unicode as UTF-8,
>      UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for
>      useful Unicode support.
>
> This hypothetical example fails to meet that requirement
> because you are no longer dealing with Unicode characters as
> basic logical units.  You are forced to deal with serialization
> issues of UTF-8.
>
> I believe that hypothetical example is a clear violation of the
> most basic Level 1 requirement, so I believe it necessarily
> therefore follows that changing UTF-8 to UTF-16 must also fail to
> meet that requirement.
>
> For example, consider U+1F47E, ALIEN MONSTER.  Java requires you to write
> that "\uD83D\uDC7E" for the preprocessing step or as "\\uD83D\\uDC7E" for
> the regex compiler.  That breaks the requirement just as surely as does
> the equivalent UTF-8 encoding, "\xF0\x9F\x91\xBE", because both make you
> consider serialization issues instead of logical code points.
>
> Please do not be distracted that this example uses an Emoji code point from
> Unicode 6.0.  I did choose that one for its cuteness effect, but the same
> problems applies to all non-Plane 0 code points, not just for the silly
> ones or the new ones.  For example, Unicode 4.0 introduced code point
> U+1033C, GOTHIC LETTER MANNA, which is not a "silly" or "cute" code point;
> there are *very* many other such, too.
>
> How to fix this to bring Java into compliance?  It's actually quite easy.
> However, you will *not* be able to fulfill this requirement by adopting
> the syntax UXXXXX or \UXXXXXXX, because that syntax is already taken by
> the regex compiler for toupper() case translation.
>
> The standard mentions that as an alternative to \uXXXX or \UXXXXXXX, the
> notation \x{XXXX}, and this is what I have elected to implement in my
> regex-rewriting library.  Even though I show four X's there, you may use
> any number of them, thus allowing you to write \x{A3}, \x{2019}, \x{1033C},
> or even \x{10FFFF}.  This now meets the requirement of being able to specify
> any Unicode code point between U+0000 and U+10FFFF.
>
> --tom



More information about the i18n-dev mailing list