<i18n dev> RL1.1 Hex Notation (part 2 of 3)

Mark Davis ☕ mark at macchiato.com
Tue Jan 25 17:09:41 PST 2011

The Unicode Standard distinguishes between Unicode Strings (16-bit) and
UTF-16. In the former, which is often the form used in programming
languages, a singleton value of 0xD800..0xDFFF is allowed, and is treated as
if it were a reserved code point.

So you do get some funny cases, because

   1. 0xD800 - 1 code point (degenerate surrogate)
   2. 0xDC00 - 1 code point (degenerate surrogate)
   3. 0xD800 0xDC00 - 1 code point (surrogate pair)
   4. 0xDC00 0xD800 - 2 code points (2 successive degenerate surrogates).

If you are working in UTF-8 or in UTF-32, then these cases wouldn't occur.
They can't happen in UTF-8, and in UTF-32 both cases 3 and 4 are 2
successive degenerate surrogates.


*— Il meglio è l’inimico del bene —*

On Sat, Jan 22, 2011 at 19:54, Tom Christiansen <tchrist at perl.com> wrote:

> Sherman,
> In part 1, I outlined my thinking of why having to make end-users think
> about represenation issues in regexes goes against if not perhaps the law,
> certainly to mind the spirit of UTS(tr)#18 when it says that a compliant
> "the regular expression engine provides support for Unicode characters as
> basic logical units."
> Please understand that I don't think that is much of a big deal -- it's a
> rather low priority bug at worse -- because when you look at it from a
> particular perspective, it appears to be a surface-level matter only.
> (Also because it is easily addressed just by adding \x{XXX}, which is
> both simple and safe.)
> It's not that big of a deal because as you yourself point out, Sherman, you
> can still specify any code point, although you have to bend over sideways
> to do it.  But it doesn't at all affect behaviour, which is by far the more
> important matter.
> These next two serialization concerns, however, are different. This time
> they are not just surface issues.  They are actual behavioral problems in
> regexes that  derive from the actual internal implementation of characters
> in Java:
>  **  Surrogate Bugs in Regexes
>  **  CANON_EQ Bugs in Regexes with \\uXXXX
> I don't think users should have to know about those implementational
> details, but if they don't, they will get several sorts of anomalous
> behaviour.  I therefore believe those two are both geniuine bugs.
> I know exactly what is causing the second one (code included), but
> fixing it is going to require some code rearrangement and reworking.
> ===========================
>  Surrogate Bugs in Regexes
> ===========================
> Here is one of them:
>    Unicode       UTF-16
>   Code Point     String            Pattern     Result
>   =========   ==============      =========    ======
>    U+1F47E    "\uD83D\uDC7E"       /^.$/       true
>     n/a       "\uD83D"             /^.$/       TRUE!
> I do not understand how that same pattern--which says to match
> strings containing a single Unicode code point only--can test on
> both those strings.  That's why I believe the TRUE! result an error.
> Don't you?
> I understand that it brings up some tricky stuff.  Consider:
>    If you have a string "HL" where H is a high surrogate and L a low
>    surrogate, Java's regex engine correctly concludes that that string
>    "HL" exactly matches the pattern "^.$" in its entirety; it has just one
>    logical character in it.   This is correct.  It fails to match "^..$",
>    which is also correct and for the same reason.
>    However, if you flip those around to get string "LH", it now exactly
>    matches the pattern "^..$" in its entirety, thus claiming it holds
>    exactly two characters even there are no legal
>    code points there!
>    If you have just one of the two surrogates, either "H" or "L", both of
>    those will also match "^.$" just as "HL" does.  That says that a single
>    surrogate is just as much a single logical character as a proper pair
>    of them together is just a single logical character.
> But that makes no senses at all.  How can both be correct?  Surely that
> *must* be a bug?  What am I not understanding here?
> I really think that rather than returning true for something that
> isn't even a legal Unicode code point, it should instead either
>    1: raise an exception
> and/or
>    2: admit some pattern flag to deal with such cases
> I say this because you are not supposed to have to deal representation
> and serialization issues in regexes, and this makes you think about them.
> It also gives you bizarre answers even when you do think about them.
> =======================================
>  CANON_EQ Bugs in Regexes with \\uXXXX
> =======================================
> Another place where you are forced to think about the internal
> representation in Java regexes, is that they can behave differently if
> you pass things in as "\\uXXXX" instead of as "\uXXXX".  I don't think
> that can be correct behaviour, either.
> The problem is that the CANON_EQ can no longer be trusted.  If you compile
> up these patterns with CANON_EQ, then it makes a difference whether you've
> used a literal or a \u0000 form.  Please consider these, as I believe that
> FALSE! results below are all in error:
>        String          Pattern
>                       w/CANON_EQ           Result
>        =========     ============        =========
>     A : "\u00E9"       "^\u00E9$"          true
>     B : "\u00E9"       "^e\u0301$"         true
>     A': "\u00E9"       "^\\u00E9$"         true
>     B': "\u00E9"       "^e\\u0301$"        FALSE!
>     C : "e\u0301"      "^\u00E9$"          true
>     D : "e\u0301"      "^e\u0301$"         true
>     C': "e\u0301"      "^\\u00E9$"         FALSE!
>     D': "e\u0301"      "^e\\u0301$"        true
> The ABCD versions all use literals converted during the lexical
> substitution phase, whereas the prime versions use UTF-16 code
> units that get passed into the regex compiler for it to consider.
> (This second mechanism is indispensable to meet the requirement
> of being able to code up any code point, and to facilitate reading
> patterns written in ASCII but specifying trans-ASCII code points.)
> You get the same problem with octal notation: you can specify U+E9 as
> "\351" for the prepass literal (which works), or as "\\0351" for the
> regex engine to see (which fails just as \\u did):
>        String          Pattern
>                       w/CANON_EQ           Result
>        =========     ============        =========
>     a : "\u00E9"       "^\351$"            true
>     a': "\u00E9"       "^\\0351$"          true
>     c : "e\u0301"      "^\351$"            true
>     c': "e\u0301"      "^\\0351$"          FALSE!
> As you might predict, using UTF-8 directly in your code and compiling with
> "java -encoding UTF-8" behaves exactly as the non-prime "\uXXXX" versions
> do, but which can be different from how the prime "\\uXXXX" version behave.
> >From looking at the code, I am sure I can reproduce this with \xXX escapes
> as well.  That's because you do the normalization reshuffle before you
> actually compile the pattern, so you won't see the octal or hex escapes
> when you're doing the normalization.  The bug is right here in this code
> right here, from around line 1500 of jdk1.7.0/java/util/regex/Pattern.java:
>    /**
>     * Copies regular expression to an int array and invokes the parsing
>     * of the expression which will create the object tree.
>     */
>    private void compile() {
>        // Handle canonical equivalences
>        if (has(CANON_EQ) && !has(LITERAL)) {
>            normalize();
>        } else {
>            normalizedPattern = pattern;
>        }
>        patternLength = normalizedPattern.length();
>        // Copy pattern to int array for convenience
>        // Use double zero to terminate pattern
>        temp = new int[patternLength + 2];
> Because things like \cC and \0XXX and \xXX and \uXXXX all get handled
> *after* that point in the code, they are *not* the same as literals with
> those values.  This is a genuine problem.
> So again we have to think about how things are stored.  It means that
> you cannot just read in patterns that have had there non-ASCII converted
> into \uXXXX escapes and have them work the same as having the literals in
> there.  Those are supposed to be the same as the literals, but they're not.
> This is quite apart from the--um, "syntactic infelicity"?--of the mismatch
> between how octal excapes are specified in the lexical substitution pass
> versus how they're specified in the regex engine.  That, I wouldn't quite
> call a bug so much as an unexpected wrinkle.  I do fix this in my regex
> rewriter, BTW.
>    (There are "syntactic infelicities" with \cC, too.  It is a bit too
>     undiscerning, producing things that aren't guaranteed to be control
>     characters because it blindly xors whatever follows it with 64.  For
>     example, \c} is = and \c= is }, \cé is © and \c© is é, etc. )
> This is message is far too long again, so I will discuss your comments
> regarding the j.l.Character class in part 3 of 3, to be sent later on.
> Thanks again!
> --tom
