<i18n dev> RL1.1 Hex Notation

Tom Christiansen tchrist at perl.com
Fri Jan 21 07:16:29 PST 2011


Here's the first requirement that must be met to claim Level 1 compliance.

Java does not yet meet this requirement, but it could easily do so: indeed,
my own regex-rewriting library implements this requirement.  It takes very
little code at all and is *completely* backwards compatible because it uses
a syntax that was previously illegal.

This guarantees there can have been no existing valid Java regex whose
behaviour would be altered by this particular implementation of the fix.

I personally believe this property of not changing how existing code
behaves to be critical--although I also recognize alternate viewpoints that
either consider it to be merely "very important" instead of "critical",
or indeed which instead hold it to be an "indispensable requirement".

Details below.

   http://www.unicode.org/reports/tr18/#Hex_notation

   +--------------------------
   | RL1.1 Hex Notation
   |
   | To meet this requirement, an implementation shall supply a mechanism
   | for specifying any Unicode code point (from U+0000 to U+10FFFF)
   +--------------------------

Java allows you to specify code points from U+0000 to U+FFFF, not to
U+10FFFF.  You can specify code points only from Plane 0, not code points
from Plane 1-16.  This is true of both the lexical substitution prepass
in which "\uXXXX" is replaced by the corresponding code point as well
as during the regex compilation itself, during which "\\uXXXX" is recognized
as a code point from Plane 0.

Note that the first of those two, the lexical analysis phase, is insufficient
to meet the requirement even for Plane 0, and so cannot be counted.  This
is because one must be able to specify *ANY* Unicode code point, but you
cannot do that with \uXXXX even in Plane 0 because certain code points
are forbidden, such as \u000A for line feed, \u0022 for double quote, and
\u005C for backslash.  Those are all treated as syntactic elements of Java
itself, not as the listed Plane 0 code pint.

Which leaves us with the \\uXXXX notation.  That indeed allows
you to specify any code point *IN PLANE 0*.  It does not permit
you to specify any  Unicode code point directly per the requirement.

Specifying single Unicode code points using indirect UTF-16 notation no
more suffices for this requirement than doing so using indirect UTF-8
notation.  Imagine if one had to specify Unicode code points using UTF-8
instead of UTF-16; I'll use the "\xXX" notation for this.  That means that
U+A3, POUND SIGN, would need to be specified as "\xC2\xA3".

But that is in clear violation of what Level 1 must provide:

    Level 1: Basic Unicode Support. At this level, the regular expression
    engine provides support for Unicode characters as basic logical units.
    (This is independent of the actual serialization of Unicode as UTF-8,
    UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for
    useful Unicode support.

This hypothetical example fails to meet that requirement
because you are no longer dealing with Unicode characters as
basic logical units.  You are forced to deal with serialization
issues of UTF-8.

I believe that hypothetical example is a clear violation of the
most basic Level 1 requirement, so I believe it necessarily
therefore follows that changing UTF-8 to UTF-16 must also fail to
meet that requirement.

For example, consider U+1F47E, ALIEN MONSTER.  Java requires you to write
that "\uD83D\uDC7E" for the preprocessing step or as "\\uD83D\\uDC7E" for
the regex compiler.  That breaks the requirement just as surely as does
the equivalent UTF-8 encoding, "\xF0\x9F\x91\xBE", because both make you
consider serialization issues instead of logical code points.

Please do not be distracted that this example uses an Emoji code point from
Unicode 6.0.  I did choose that one for its cuteness effect, but the same
problems applies to all non-Plane 0 code points, not just for the silly
ones or the new ones.  For example, Unicode 4.0 introduced code point
U+1033C, GOTHIC LETTER MANNA, which is not a "silly" or "cute" code point;
there are *very* many other such, too.

How to fix this to bring Java into compliance?  It's actually quite easy.
However, you will *not* be able to fulfill this requirement by adopting
the syntax UXXXXX or \UXXXXXXX, because that syntax is already taken by
the regex compiler for toupper() case translation.

The standard mentions that as an alternative to \uXXXX or \UXXXXXXX, the
notation \x{XXXX}, and this is what I have elected to implement in my
regex-rewriting library.  Even though I show four X's there, you may use
any number of them, thus allowing you to write \x{A3}, \x{2019}, \x{1033C},
or even \x{10FFFF}.  This now meets the requirement of being able to specify
any Unicode code point between U+0000 and U+10FFFF.

--tom


More information about the i18n-dev mailing list